idnits 2.17.1 draft-klensin-idna-5892upd-unicode70-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == It seems as if not all pages are separated by form feeds - found 34 form feeds but 744 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 1198: '...ated to True for the label, it MUST be...' -- The draft header indicates that this document updates RFC5892, but the abstract doesn't seem to directly say this. It does mention RFC5892 though, so this could be OK. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year (Using the creation date from RFC5892, updated by this document, for RFC5378 checks: 2008-04-26) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 8, 2017) is 2392 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Duplicate reference: RFC5892, mentioned in 'RFC5892Erratum', was also mentioned in 'RFC5892'. ** Downref: Normative reference to an Informational RFC: RFC 5894 ** Downref: Normative reference to an Informational RFC: RFC 6943 -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15' -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Exclusion' -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Versioning' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode5' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode7' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode70-Arabic' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode70-CompatDecomp' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode70-Design' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode70-Hamza' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode70-Overlay' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode70-Stability' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTS46' -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 18 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Klensin 3 Internet-Draft 4 Updates: 5892, 5894 (if approved) P. Faltstrom 5 Intended status: Standards Track Netnod 6 Expires: April 11, 2018 October 8, 2017 8 IDNA Update for Unicode 7.0 and Later Versions 9 draft-klensin-idna-5892upd-unicode70-05 11 Abstract 13 The current version of the IDNA specifications anticipated that each 14 new version of Unicode would be reviewed to verify that no changes 15 had been introduced that required adjustments to the set of rules 16 and, in particular, whether new exceptions or backward compatibility 17 adjustments were needed. The review for Unicode 7.0.0 first 18 identified a potentially problematic new code point and then a much 19 more general and difficult issue with Unicode normalization. This 20 specification discusses those issues and proposes updates to IDNA 21 and, potentially, the way the IETF handles comparison of identifiers 22 more generally, especially when there is no associated language or 23 language identification. It also applies an editorial clarification 24 to RFC 5892 that was the subject of an earlier erratum and updates 25 RFC 5894 to point to the issues involved. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on April 11, 2018. 44 Copyright Notice 46 Copyright (c) 2017 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (https://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.1. Origins and Discovery of the Issue . . . . . . . . . . . 4 63 1.2. IDNA2008 and Special or Exceptional Cases . . . . . . . . 5 64 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 65 2. Document Aspirations . . . . . . . . . . . . . . . . . . . . 8 66 3. Problem Description . . . . . . . . . . . . . . . . . . . . . 8 67 3.1. IDNA assumptions about Unicode normalization . . . . . . 8 68 3.2. The discovery and the Arabic script cases . . . . . . . . 10 69 3.2.1. New code point U+08A1, decomposition, and language 70 dependency . . . . . . . . . . . . . . . . . . . . . 10 71 3.2.2. Other examples of the same behavior within the Arabic 72 Script . . . . . . . . . . . . . . . . . . . . . . . 11 73 3.2.3. Hamza and Combining Sequences . . . . . . . . . . . . 11 74 3.3. Precomposed characters without decompositions more 75 generally . . . . . . . . . . . . . . . . . . . . . . . . 12 76 3.3.1. Description of the general problem . . . . . . . . . 12 77 3.3.2. Latin Examples and Cases . . . . . . . . . . . . . . 14 78 3.3.2.1. The font exclusion and compatability 79 relationships . . . . . . . . . . . . . . . . . . 14 80 3.3.2.2. The phonetic notation characters and extensions . 14 81 3.3.2.3. The stroke (solidus) ambiguity . . . . . . . . . 14 82 3.3.2.3.1. Combining dots and other shapes combine... 83 unless... . . . . . . . . . . . . . . . . . . 15 84 3.3.2.3.2. "Legacy" characters and new additions . . . . 16 85 3.3.3. Unexpected Combining Sequances . . . . . . . . . . . 16 86 3.3.4. Examples and Cases from Other Scripts . . . . . . . . 17 87 3.3.4.1. Scripts with precomposed preferences and ones 88 with combining preferences . . . . . . . . . . . 17 89 3.3.4.2. The Han and Kangxu Cases . . . . . . . . . . . . 17 90 3.4. Confusion and the Casual User . . . . . . . . . . . . . . 17 91 4. Implementation options and issues: Unicode properties, 92 exceptions, and the nature of stability . . . . . . . . . . . 18 93 4.1. Unicode Stability compared to IETF (and ICANN) Stability 18 94 4.2. New Unicode Properties . . . . . . . . . . . . . . . . . 19 95 4.3. The need for exception lists . . . . . . . . . . . . . . 20 96 5. Proposed/ Alternative Changes to RFC 5892 for the issues 97 first exposed by new code point U+08A1 . . . . . . . . . . . 20 98 5.1. Disallow This New Code Point . . . . . . . . . . . . . . 20 99 5.2. Disallow This New Code Point and All Future Precomposed 100 Additions that Do Not Decompose . . . . . . . . . . . . . 22 101 5.3. Disallow the combining sequences for these characters . . 22 102 5.4. Use Combinnig Classes to Develop Additional Contextual 103 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 23 104 5.5. Disallow all Combining Characters for Specific Scripts . 23 105 5.6. Do Nothing Other Than Warn . . . . . . . . . . . . . . . 24 106 5.7. Normalization Form IETF (NFI)) . . . . . . . . . . . . . 25 107 6. Editorial clarification to RFC 5892 . . . . . . . . . . . . . 26 108 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 26 109 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 26 110 9. Security Considerations . . . . . . . . . . . . . . . . . . . 27 111 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 28 112 10.1. Normative References . . . . . . . . . . . . . . . . . . 28 113 10.2. Informative References . . . . . . . . . . . . . . . . . 30 114 Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 33 115 A.1. Changes from version -00 (2014-07-21)to -01 . . . . . . . 33 116 A.2. Changes from version -01 (2014-12-07) to -02 . . . . . . 33 117 A.3. Changes from version -02 (2014-12-07) to -03 . . . . . . 33 118 A.4. Changes from version -03 (2015-01-06) to -04 . . . . . . 33 119 A.5. Changes from version -04 (2015-03-11) to -05 . . . . . . 34 120 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 34 122 1. Introduction 124 Note in/about -04 and -05 Drafts: These two versions of the 125 document contains a very large amount of new material as compared 126 to the -03 version. The new material reflects an evolution of 127 community understanding in the first quarter of 2015 and further 128 evolution between then and mid-2017 from an assumption that the 129 problem involved only a few code points and one combining 130 character in a single script (Hamza Above and Arabic) to an 131 understanding that the problem we have come to call "non- 132 decomposing code points" and several closely related ones are 133 quite pervasive and may represent fundamental misunderstandings or 134 omissions from IDNA2008 (and, by extension, the basics of PRECIS 135 [RFC8264]) that must be corrected if those protocols are going to 136 be used in a way that supports internationalized identifiers on 137 the Internet predictably (as seen by the end user) and securely. 139 This version is still necessarily incomplete: not only is our 140 understanding probably still not comprehensive, but there are a 141 number of placeholders for text and references. Nonetheless, the 142 document in its current form should be useful as both the 143 beginning of a comprehensive overview is the issues and a source 144 of references to other relevant materials. 146 This draft could almost certainly be better organized to improve 147 its readability: specific suggestions would be welcome. 149 1.1. Origins and Discovery of the Issue 151 The current version of the IDNA specifications, known as "IDNA2008" 152 [RFC5890], anticipated that each new version of Unicode would be 153 reviewed to verify that no changes had been introduced that required 154 adjustments to IDNA's rules and, in particular, whether new 155 exceptions or backward compatibility adjustments were needed. When 156 that review was carefully conducted for Unicode 7.0.0 [Unicode7], 157 comparing it to prior versions including the text in Unicode 6.2 158 [Unicode62], it identified a problematic new code point (U+08A1, 159 ARABIC LETTER BEH WITH HAMZA ABOVE). The code point was added for 160 Arabic Script use with the Fula (also known as Fulfulde, Pulaar, amd 161 Pular'Fulaare) language. That language is apparently most often 162 written in Latin characters today [Omniglot-Fula] [Dalby] [Daniels]. 164 The specific problem is discussed in detail in Section 3. In very 165 broad terms, IDNA (and other IETF work) assume that, if one can 166 represent "the same character" either as a combining sequence or as a 167 single code point, strings that are identical except for those 168 alternate forms will compare equal after normalization. Part of the 169 difficulty that has characterized this discussion is that "the same" 170 differs depending on the criteria that are chosen. It may be further 171 complicated in practice by differences in preferred type styles or 172 rendering, but Unicode code point choices are not supposed to depend 173 on type style (font) variations and, again, IDNA has no mechanism for 174 specifying language choices that might affect rendering. 176 The behavior of the newly-added code point, while non-optimal for 177 IDNA, follows that of a few code points that predate Unicode 7.x and 178 even the IDNA 2008 specifications and Unicode 6.0. Those existing 179 code points, which may not be easy to accurately characterize as a 180 group, make the question of what, if anything, to do about this new 181 exceedingly problematic one and, perhaps separately, what to do about 182 existing sets of code points with the same behavior, because 183 different reasonable criteria yield different decisions, 184 specifically: 186 o To disallow it (and future, but not existing, characters with 187 similar characteristics) as an IDNA exception case creates 188 inconsistencies with how those earlier code points were handled. 190 o To disallow it and the similar code points as well would 191 necessitate invalidating some potential labels that would have 192 been valid under IDNA2008 until this time. Depending on how the 193 collection of similar code points is characterized, a few of them 194 are almost certainly used in reasonable labels. 196 o To permit the new code point to be treated as PVALID creates a 197 situation in which it is possible, within the same script, to 198 compose the same character symbol (glyph or grapheme) in two 199 different ways that do not compare equal even after normalization. 200 That condition would then apply to it and the earlier code points 201 with the same behavior. That situation contradicts a fundamental 202 assumption of IDNA that is discussed in more detail below. 204 NOTE IN DRAFT: 206 This working draft discusses six alternatives, including an idea 207 (an IETF-specific normalization form) that seemed too drastic to 208 be considered when IDNA2008 was designed or even when the review 209 of Unicode 7.0 for IDAN purposes began. In retrospect, it not 210 only would have been appropriate to discuss when the IDNA2008 211 specifications were being developed but is appearing more 212 attractive now. The authors suggest that the community discuss 213 the relevant tradeoffs and make a decision and that the document 214 then be revised to reflect that decision, with the other 215 alternatives discussed as options not chosen. Because there is no 216 ideal choice, the discussion of the issues in Section 3 is 217 probably as or more important than the particular choice of how to 218 handle this code point. In addition to providing information for 219 this document, that section should be considered as an updating 220 addendum to RFC 5894 [RFC5894] and should be incorporated into any 221 future revision of that document. 223 As the result of this version of the document containing several 224 alternate proposals, some of the text is also a little bit 225 redundant. That will be corrected in future versions. 227 1.2. IDNA2008 and Special or Exceptional Cases 229 IDNA2008 contains several type of explicit provisions for characters 230 (code points) that require special treatment when the requirements of 231 the DNS cannot easily be met by calculations based on stable Unicode 232 properties. Those provisions are 233 [[CREF1: ... to be supplied]] 235 As anticipated when IDNA2008, and RFC 5892 in particular, were 236 written, exceptions and explicit updates are likely to be needed only 237 if there is disagreement between the Unicode Consortium's view about 238 what is best for the Standard and its very diverse user community and 239 the IETF's view of what is best for IDNs, the DNS, and IDNA. It was 240 hoped that a situation would never arise in which the the two 241 perspectives would disagree, but the possibility was anticipated and 242 considerable mechanism added to RFC 5890 and 5982 as a result. It is 243 probably important to note that a disagreement in this context does 244 not imply that anyone is "wrong", only that the two different groups 245 have different needs and therefore criteria about what is acceptable. 246 In particular, it appears that the Unicode Consortium has made 247 assumptions about the availability (by explicit designation or 248 context) of information about applicable languages or other context 249 for a give string that are not possible for IDNA. For that reason, 250 the IETF has, in the past, allowed some characters for IDNA that 251 active Unicode Technical Committee members suggested be disallowed to 252 avoid a change in derived tables [RFC6452]. This document describes 253 a set of cases for which the IETF must consider disallowing sets of 254 characters that the various properties would otherwise treat as 255 PVALID. 257 This document provides the "flagging for the IESG" specified by 258 Section 5.1 of RFC 5892. As specified there, the change itself 259 requires IETF review because it alters the rules of Section 2 of that 260 document. 262 [[RFC Editor: please remove the following comment and note if they 263 get to you.]] 265 [[IESG: It might not be a bad idea to incorporate some version of 266 the following into the Last Call announcement.]] 268 NOTE IN DRAFT to IETF Reviewers: The issues in this document, and 269 particularly the choices among options for either adding exception 270 cases to RFC 5892 or ignoring the issue, warning people, and 271 hoping the results do not include or enable serious problems, are 272 fairly esoteric. Understanding them requires that one have at 273 least some understanding of how scripts in which precomposed 274 characters are preferred over combining sequences as a Unicode 275 design and extension principle work. Those scripts include Arabic 276 but, unlike the assumption when the issues were first discovered, 277 are by no means limited to it. Readers should also understand the 278 reasons the Unicode Standard gives various Arabic Script 279 characters a fairly extended discussion [Unicode70-Arabic] but 280 should treat that only as an example and note that most other 281 cases are much less well documented. It also requires 282 understanding of a number of Unicode principles, including the 283 Normalization Stability rules [UAX15-Versioning] as applied to new 284 precomposed characters and guidelines for adding new characters. 285 There is considerable discussion of the issues in Section 3 and 286 references are provided for those who want to pursue them, but 287 potential reviewers should assume that the background needed to 288 understand the reasons for this change is no less deep in the 289 subject matter than would be expected of someone reviewing a 290 proposed change in, e.g., the fundamentals of BGP, TCP congestion 291 control, or some cryptographic algorithm. Put more bluntly, one's 292 ability to read or speak languages other than English, or even one 293 or more languages that use the Arabic script or other scripts 294 similarly affected, does not make one an expert in these matters. 296 1.3. Terminology 298 This document assumes that the reader is reasonably familiar with the 299 terminology of IDNA [RFC5890] and Unicode [Unicode7] and with the 300 IETF conventions for representing Unicode code points [RFC5137]. 301 Some terms used here may not be used in the same way in those two 302 sets of documents. From one point of view, those differences may 303 have been the results of, or led to, misunderstandings that may, in 304 turn, be part of the root cause of the problems explored in this 305 document. In particular, this document uses the term "precomposed 306 character" to describe characters that could reasonably be composed 307 by a combining sequence using code points with appropriate appearance 308 in common type styles but for which a single code point that does not 309 require combining sequences is available. That definition is 310 strictly about mechanical composition and does not involve any 311 considerations about how the character is used. It is closely 312 related to this document's definition of "identical". When a 313 precomposed character exists and either applying NFC to the combining 314 sequence does not yield that character or applying NFD to that 315 character's code point does not yield the combining sequence, it is 316 referred to in this document as "non-decomposable". 318 The document also uses some terms that are familiar to those who have 319 been involved with IDNs and IDNA for a long time, but uses them more 320 precisely than may be common in other quarters. For example, the 321 term "Punycode" is not used at all in the rest of this document 322 because it is the name of a very specific encoding algorithm 323 [RFC3492] that does not incorporate the rules and algorithms for 324 domain name labels that are produced by that encoding. Instead, the 325 generic terms "ACE" or "ACE string" for "ASCII-compatible encoding" 326 is used to refer to strings that abstractly contain characters 327 outside the ASCII repertoire [RFC0020] but are encoded so that only 328 ASCII characters appear in the string that would be encountered by a 329 user or protocol and the terms "A-label" and "U-label", as defined in 330 RFC 5890, to refer to the ACE and more conventional (or "native") 331 character forms in which those non-ASCII characters appear in 332 conventional Unicode encodings (typically UTF-8). 334 2. Document Aspirations 336 This document, in its present form, is not a proposal for a solution. 337 Instead, it is intended to be (or evolve into) a comprehensive 338 description of the issues and problems and to outline some possible 339 approaches to a solution. A perfect solution -- one that would 340 resolve all of the issues identified in this document -- would 341 involve a relatively small set of relatively simple rules and hence 342 would be comprehensible and predictable for and by non-expert end 343 users, would not require code point by code point or even block by 344 block exception lists, and would not leave uses of any script or 345 language feeling that their particular writing system have been 346 treated less fairly than others. 348 Part of the reality we need to accept is that IDNA, in its present 349 form, represents compromises that does not completely satisfy those 350 criteria and whatever is done about these issues will probably make 351 it (or the job of administering zones containing IDNs) more complex. 352 Similarly, as the Unicode Standard suggests when it identifies ten 353 Design Principles and the text then says "Not all of these principles 354 can be satisfied simultaneously..." [Unicode70-Design], while there 355 are guidelines and principles, a certain amount of subjective 356 judgment is involved in making determinations about normalization, 357 decomposition, and some property values. For Unicode itself, those 358 issues are resolved by multiple statements (at least one cited below) 359 that one needs to rely on per-code point information in the Unicode 360 Character Database rather than on rules or principles. The design of 361 IDNA and the effort to keep it largely independent of Unicode 362 versions requires rules, categories, and principles that can be 363 relied upon and applied algorithmically. There is obviously some 364 tension between the two approaches. 366 3. Problem Description 368 3.1. IDNA assumptions about Unicode normalization 370 IDNA makes several assumptions about Unicode, Unicode "characters", 371 and the effects of normalization. Those assumptions were based on 372 careful reading of the Unicode Standard at the time [Unicode5], 373 guided by advice and commitments by members of the Unicode Technical 374 Committee. Those assumptions, and the associated requirements, are 375 necessitated by three properties of DNS labels that typically do not 376 apply to blocks of running text: 378 1. There is no language context for a label. While particular DNS 379 zones may impose restrictions, including language or script 380 restrictions, on what labels can be registered, neither the DNS 381 nor IDNA impose either type of restriction or give the user of a 382 label any indication about the registration or other restrictions 383 that may have been imposed. 385 2. Labels are often mnemonics rather than words in any language. 386 They may be abbreviations or acronyms or contain embedded digits 387 and have other characteristics that are not typical of words. 389 3. Labels are, in practice, usually short. Even when they are the 390 maximum length allowed by the DNS and IDNA, they are typically 391 too short to provide significant context. Statements that 392 suggest that languages can almost always be determined from 393 relatively short paragraphs or equivalent bodies of text do not 394 apply to DNS labels because of their typical short length and 395 because, as noted above, they are not required to be formed 396 according to language-based rules. 398 At the same time, because the DNS is an exact-match system, there 399 must be no ambiguity about whether two labels are equal. Although 400 there have been extensive discussions about "confusingly similar" 401 characters, labels, and strings, such tests between scripts are 402 always somewhat subjective: they are affected by choices of type 403 styles and by what the user expects to see. In spite of the fact 404 that the glyphs that represent many characters in different scripts 405 are identical in appearance (e.g., basic Latin "a" (U+0061) and the 406 identical-appearing Cyrillic character (U+0430), the most important 407 test is that, if two glyphs are the same within a given script, they 408 must represent the same character no matter how they are formed. 410 Unicode normalization, as explained in [UAX15], is expected to 411 resolve those "same script, same glyph, different formation methods" 412 issues. Within the Latin script, the code point sequence for lower 413 case "o" (U+006F) and combining diaeresis (U+0308) will, when 414 normalized using the "NFC" method required by IDNA, produce the 415 precomposed small letter o with diaeresis (U+00F6) and hence the two 416 ways of forming the character will compare equal (and the combining 417 sequence is effectively prohibited from U-labels). 419 NFC was preferred over other normalization methods for IDNA because 420 it is more compact, more likely to be produced on keyboards on which 421 the relevant characters actually appeared, and because it does not 422 lose substantive information (e.g., some types of compatibility 423 equivalence involves judgment calls as to whether two characters are 424 actually the same -- they may be "the same" in some contexts but not 425 others -- while canonical equivalence is about different ways to 426 produce the glyph for the same abstract character). 428 IDNA also assumed that the extensive Unicode stability rules would be 429 applied and work as specified when new code points were added. Those 430 rules, as described in The Unicode Standard and the normative annexes 431 identified below, provide that: 433 1. New code points representing precomposed characters that can be 434 formed from combining sequences will not be added to Unicode 435 unless neither the relevant base character nor required combining 436 character(s) are part of the Standard within the relevant script 437 [UAX15-Versioning]. 439 2. If circumstances require that principle be violated, 440 normalization stability requires that the newly-added character 441 decompose (even under NFC) to the previously-available combining 442 sequence [UAX15-Exclusion]. 444 At least at the time IDNA2008 was being developed, there was no 445 explicit provision in the Standard's discussion of conditions for 446 adding new code points, nor of normalization stability, for an 447 exception based on different languages using the same script or 448 ambiguities about the shape or positioning of combining characters. 450 3.2. The discovery and the Arabic script cases 452 While the set of problems with normalization discussed above were 453 discovered with a newly-added code point for the Arabic Script and 454 some characteristics of Unicode handling of that script seem to make 455 the problem more complex going forward, these are not issues specific 456 to Arabic. This section describes the Arabic-specific problems; 457 subsequent ones (starting with Section 3.3) discuss the problem more 458 generally and include illustrations from other scripts. 460 3.2.1. New code point U+08A1, decomposition, and language dependency 462 Unicode 7.0.0 introduces the new code point U+08A1, ARABIC LETTER BEH 463 WITH HAMZA ABOVE. As can be deduced from the name, it is visually 464 identical to the glyph that can be formed from a combining sequence 465 consisting of the code point for ARABIC LETTER BEH (U+0628) and the 466 code point for Combining Hamza Above (U+0654). The two rules 467 summarized above (see the last part of Section 3.1) suggest that 468 either the new code point should not be allocated at all or that it 469 should have a decomposition to \u'0628'\u'0654'. 471 Had the issues outlined in this document been better understood at 472 the time, it probably would have been wise for RFC 5892 to disallow 473 either the precomposed character or the combining sequence of each 474 pair in those cases in which Unicode normalization rules do not cause 475 the right thing to happen, i.e., the combining sequence and 476 precomposed character to be treated as equivalent. Failure to do so 477 at the time places an extra burden on registries to be sure that 478 conflicts (and the potential for confusion and attacks) do not exist. 479 Oddly, had the exclusion been made part of the specification at that 480 time, the preference for precomposed forms noted above would probably 481 have dictated excluding the combining sequence, something not 482 otherwise done in IDNA2008 because the NFC requirement serves the 483 same purpose. Today, the only thing that can be excluded without the 484 potential disruption of disallowing a previously-PVALID combining 485 sequence is the to exclude the newly-added code point so whatever is 486 done, or might have been contemplated with hindsight, will be 487 somewhat inconsistent. 489 3.2.2. Other examples of the same behavior within the Arabic Script 491 One of the things that complicates the issue with the new U+08A1 code 492 point is that there are several other Arabic-script code points that 493 behave in the same way for similar language-specific reasons. 495 In particular, at least three other grapheme clusters that have been 496 present for many version of Unicode can be seen as involving issues 497 similar to those for the newly-added ARABIC LETTER BEH WITH HAMZA 498 ABOVE. ARABIC LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER 499 REH WITH HAMZA ABOVE (U+076C) do not have decomposition forms and are 500 preferred over combining sequences using HAMZA ABOVE (U+0654) 501 [Unicode70-Hamza]. By contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE 502 (U+0623) decomposes into \u'0627'\u'0654', ARABIC LETTER WAW WITH 503 HAMZA ABOVE (U+0624) decomposes into \u'0648'\u'0654', and ARABIC 504 LETTER YEH WITH HAMZA ABOVE (U+0626) decomposes into \u'064A'\u'0654' 505 so the precomposed character and combining sequences compare equal 506 when both are normalized, as this specification prefers. 508 There are other variations in which a precomposed character involving 509 HAMZA ABOVE has a decomposition to a combining sequence that can form 510 it. For example, ARABIC LETTER U WITH HAMZA ABOVE (U+0677) has a 511 compatibility decomposition. but not a canonical one, into the 512 combining sequence \u'06C7'\u'0674'. 514 3.2.3. Hamza and Combining Sequences 516 As the Unicode Standard points out at some length [Unicode70-Arabic], 517 Hamza is a problematic abstract character and the "Hamza Above" 518 construction even more so [Unicode70-Hamza]. Those sections explain 519 a distinction made by Unicode between the use of a Hamza mark to 520 denote a glottal stop and one used as a diacritic mark to denote a 521 separate letter. In the first case, the combining sequence is used. 522 In the second, a precomposed character is assigned. 524 Unlike Unicode generally and because of concerns about identifier 525 spoofing and attacks based on similarities, character distinctions in 526 IDNA are based much more strictly on the appearance of characters; 527 language and pronunciation distinctions within a script are not 528 considered. So, for IDNA, BEH WITH HAMZA ABOVE is not-quite- 529 tautologically the same as BEH WITH HAMZA ABOVE, even if one of them 530 is written as U+08A1 (new to Unicode 7.0.0) and the other as the 531 sequence \u'0628'\u'0654' (feasible with Unicode 7.0.0 but also 532 available in versions of Unicode going back at least to the version 533 [Unicode32] used in the original version of IDNA [RFC3490]. Because 534 the precomposed form and combining sequence are, for IDNA purposes, 535 the same, IDNA expects that normalization (specifically the 536 requirement that all U-labels be in NFC form) will cause them to 537 compare equal. 539 If Unicode also considered them the same, then the principle would 540 apply that new precomposed ("composition") forms are not added unless 541 one of the code points that could be used to construct it did not 542 exist in an earlier version (and even then is discouraged) 543 [UAX15-Versioning]. When exceptions are made, they are expected to 544 conform to the rules and classes in the "Composition Exclusion 545 Table", with class 2 being relevant to this case [UAX15-Exclusion]. 546 That rule essentially requires that the normalization for the old 547 combining sequence to itself be retained (for stability) but that the 548 newly-added character be treated as canonically decomposable and 549 decompose back to the older sequence even under NFC. That was not 550 done for this particular case, presumably because of the distinction 551 about pronunciation modifiers versus separate letters noted above. 552 Because, for IDNA and the DNS, there is a possibility that the 553 composing sequence \u'0628'\u'0654' already appears in labels, the 554 only choice other than allowing an otherwise-identical, and 555 identically-appearing, label with U+08A1 substituted to identify a 556 different DNS entry is to DISALLOW the new character. 558 3.3. Precomposed characters without decompositions more generally 560 3.3.1. Description of the general problem 562 As mentioned above, IDNA made a strong assumption that, if there were 563 two ways to form the same abstract character in the same script, 564 normalization would result in them comparing equal. Work on IDNA2008 565 recognized that early version of Unicode might also contain some 566 inconsistencies; see Section 3.3.2.3.2 below. 568 Having precomposed code points exist that don't have decompositions, 569 or having code points of that nature allocated in the future, is 570 problematic for those IDNA assumptions about character comparison. 571 It seems to call for either excluding some set of code points that 572 IDNA's rules do not now identify, development and use of a 573 normalization procedure that behaves as expected (those two options 574 may be nearly equivalent for many purposes), or deciding to accept a 575 risk that, apparently, will only increase over time. 577 It is not clear whether the reasons the IDNABIS WG did not understand 578 and allow for these cases are important except insofar as they inform 579 considerations about what to do in the future. It seemed (and still 580 seems to some people) that the Unicode Standard is very clear on the 581 matter (or at least was when IDNA2008 was being developed). In 582 addition to the normalization stability rules cited in the last part 583 of Section 3.1. the discussion in the Core Standard seems quite 584 clear. For example, "Where characters are used in different ways in 585 different languages, the relevant properties are normally defined 586 outside the Unicode Standard" in Section 2.2, subsection titled 587 "Semantics" [Unicode7] did not suggest to most readers that sometimes 588 separate code points would be allocated within a script based on 589 language considerations. Similarly, the same section of the Standard 590 says, in a subsection titled "Unification", "The Unicode Standard 591 avoids duplicate encoding of characters by unifying them within 592 scripts across language" and does not list exceptions to that rule or 593 limit it to a single script although it goes on to list "CJK" as an 594 example. Another subsection, "Equivalent Sequences" indicates 595 "Common precomposed forms ... are included for compatibility with 596 current standards. For static precomposed forms, the standard 597 provides a mapping to an equivalent dynamically composed sequence of 598 characters". The latter appears to be precisely the "all precomposed 599 characters decompose into the relevant combining sequences if the 600 relevant base and combining characters exist in the Standard" rule 601 that IDNA needs and assumed and, again, there is no mention of 602 exceptions, language-dependent of otherwise. The summary of 603 stability policies cited in the Standard [Unicode70-Stability] does 604 not appear to shed any additional light on these issues. 606 The Standard now contains a subsection titled "Non-decomposition of 607 Overlaid Diacritics" [Unicode70-Overlay] that identifies a list of 608 diacritics that do not normally form characters that have 609 decompositions. The rule given has its own exceptions and the text 610 clearly states that there is actually no way to know whether a code 611 point has a decomposition other than consulting the Unicode Character 612 Database entry for that code point. The subsequent section notes 613 that this can be a security problem. While the issues with IDNA go 614 well beyond what is normally considered security, that comment now 615 seems clear. While that subsection is helpful in explaining the 616 problem, especially for European scripts, it does not appear in the 617 Unicode versions that were current when IDNA2008 was being developed. 619 3.3.2. Latin Examples and Cases 621 While this set of problems was discovered because of a code point 622 added to the Arabic script in precombined form to support a 623 particular language, there are actually far more examples for, e.g., 624 Latin script than there are for Arabic script. Many of them are 625 associated with the "non-decomposition of combining diacriticals" 626 issues mentioned above, but the next subsections describe other cases 627 that are not directly bound to decomposition. 629 3.3.2.1. The font exclusion and compatability relationships 631 Unicode contains a large collection of characters that are identified 632 as "Mathematical Symbols". A large subset of them are basic or 633 decorated Latin characters, differing from the ordinary ones only by 634 their usage and, in appearance, by font or type styling (despite the 635 general principle that font distinctions are not used as the basis 636 for assigning separate code points. Most of these have canonical 637 mappings to the base form, which eliminates them from IDNA, but 638 others do not and, because the same marks that are used as phonetic 639 diacritical markings in conventional alphabetical use have special 640 mathematical meanings, applications that permit the use of these 641 characters have their own issues with normalization and equality. 643 3.3.2.2. The phonetic notation characters and extensions 645 Another example involves various Phonetic Alphabet and Extension 646 characters. many of which, unlike the Mathematical ones, do not have 647 normalizations that would make them compare equal to the basic 648 characters with essentially identical representations. This would 649 not be a problem for IDNA if they were identified with a specialized 650 script or as symbols rather than letters, but neither is the case: 651 they are generally identified as lower case Latin Script letters even 652 when they are visually upper-case, another issue for IDNA. 654 3.3.2.3. The stroke (solidus) ambiguity 656 Some combining characters have two or more forms. for example, in 657 the case of the character popularly known as "slash", "stroke", or 658 "solidus" (sometime prefixed by "forward"), there are "short" and 659 "long" combining forms, U+0337 (COMBINING SHORT SOLIDUS OVERLAY) and 660 U+0338 (COMBINING LONG SOLIDUS OVERLAY). It is not clear how long a 661 short one needs to be to make it "long" or how short a long one needs 662 to be to make it "short". Perhaps for that reason, U+00F8 has no 663 decomposition and neither U+006F U+0337 nor U+006F U+0338 combine to 664 it with NFC. 666 Adding to the confusion, at least when one attempts to use Unicode 667 character names to identify places to look for problems, U+00F8 is 668 formally called LATIN SMALL LETTER O WITH STROKE but, in combining 669 character terminology, the term "stroke" refers to a horizontal bar, 670 not an angled one, as in U+0335 and U+0336 (also short and long 671 versions). However, when one overlays one of those on an "o" 672 (U+006F), one gets U+0275, LATIN SMALL LETTER BARRED O, not "...o 673 with stroke". That character, by the way, does not decompose either. 674 This does illustrate the principle that it is not feasible to rely on 675 Unicode code point names to identify confusable character sequences, 676 even ones that produce the same, more or less font-independent, 677 grapheme clusters. 679 3.3.2.3.1. Combining dots and other shapes combine... unless... 681 The discussion of "Non-decomposition of Overlaid Diacritics" 682 [Unicode70-Overlay] indirectly exhibits at least one reason why it 683 has been difficult to characterize the problem. If one combines that 684 subsection with others, one gets a set of rules that might be 685 described as: 687 1. If the precomposed character and the code points that make up the 688 combining sequence exist, then canonical composition and 689 decomposition work as expected, except... 691 2. If the precomposed character was added to Unicode after the code 692 points that make up the combining sequence, normalization 693 stability for the combining sequences requires that NFC applied 694 to the precomposed character decomposes rather than having the 695 combining sequence compose to the new character, however... 697 3. If the combining sequence involves a diacritic or other mark that 698 actually touches the base character when composed, the 699 precomposed character does not have a decomposition, unless... 701 4. The combining diacritic involved is Cedilla (U+0327), Ogonek 702 (U+0328), or Horn (U+031B), in which case the precomposed 703 characters that contain them "regularly" (but presumably not 704 always) decomposes, and... 706 5. There are further exceptions for Hamza which does not overlay the 707 associated base character in the same way the Latin-derived 708 combining diacritics and other marks do. Those decisions to 709 decompose a precomposed character (or not) are based on language 710 or phonetic considerations, not the combining mechanism or 711 appearance, or perhaps,... 713 6. Some characters have compatibility decompositions rather than 714 canonical ones [Unicode70-CompatDecomp]. Because compatibility 715 relationships are treated differently by IDNA, PRECIS [RFC8264], 716 and, potentially, other protocols involving identifiers for 717 Internet use, the existence of compatibility relationship may or 718 may not be helpful. Finally,... 720 7. There is no reason to believe the above list is complete. In 721 particular, if whether a precomposed character decomposes or not 722 is determined by language or phonetic distinctions or by a 723 decision that all new characters for some scripts will be 724 precomposed while new ones for others will be added (if needed) 725 as combining sequences, one may need additional rules on a per- 726 script and/or per-character basis. 728 The above list only covers the cases involving combining sequences. 729 It does not cover cases such as those in Section 3.3.2.1 and 730 Section 3.3.2.2 and there may be additional groups of cases not yet 731 identified. 733 3.3.2.3.2. "Legacy" characters and new additions 735 The development of categories and rules for IDNA recognized that 736 early version of Unicode might contain some inconsistencies if 737 evaluated using more contemporary rules about code point assignments 738 and stability. In particular, there might be some exceptions from 739 different practices in early version of Unicode or anomalies caused 740 by copying existing single- or dual-script standards into Unicode as 741 block rather than individual character additions to the repertoire. 742 The possibility of such "legacy" exceptions was one reason why the 743 IDNA category rules include explicit provisions for exception lists 744 (even though no such code points were identified prior to 2014). 746 3.3.3. Unexpected Combining Sequances 748 Most combining characters have the script property "Inherited" or 749 "Common", i.e., are not members of any particular script and will not 750 cause rules against mixed-script labels to be triggered. 751 Normalization rules are generally structured around the base 752 character, so unexpected combinations of base characters with 753 combining ones may lead to cases where normalization might normally 754 be expected to produce a precombined character but does not do so (in 755 the most common situation because no such precombined character 756 exists. For example, the Latin script characters "a" and "a with 757 acute accent" are both coded (as U+0061 and U+00E1). If the latter 758 is coded as the combining sequence U+0061 U+0301, NFC will turn that 759 sequence into U+00E1 and everything will work as users expect. 760 However, the Cyrillic "a" character (U+0430) is notoriously similar 761 in appearance in most type styles to U+0061 and the U+0439 U+0301 and 762 that sequence does not normalize to anything else. Because thre is 763 no code point assigned for Cyrillic small letter a with acute accent 764 and unlike many of the other examples in this document, that is 765 Unicode working exactly as would be expected. Whether it is an issue 766 or not depends on the questions that are being asked and what rules 767 are being applied. 769 3.3.4. Examples and Cases from Other Scripts 771 Research into these issues has not yet turned up a comprehensive list 772 of affected scripts and code points. As discussed elsewhere in this 773 document, it is clear that Arabic and Latin Scripts are significantly 774 affected, that some Han and Kangxu radicals and ideographs are 775 affected, and that other examples do exist -- it is just not known 776 how many of those examples there are and what patterns, if any, 777 characterize them. 779 3.3.4.1. Scripts with precomposed preferences and ones with combining 780 preferences 782 While the authors have been unable to find an explanation for the 783 differentiation in the Unicode Standard, we have been told that there 784 are differences among scripts as to whether the action preference is 785 to add new combining sequences only (and resist adding precomposed 786 characters) as suggested in Section 3.3.2.3.1 or to add precomposed 787 characters, often ones that do not have decompositions. If those 788 difference in preference do exist, it is probably important to have 789 them documented so that they can be reflected in IDNA review 790 procedures and elsewhere. It will also require IETF discussion of 791 whether combining sequences should be deprecated when the 792 corresponding precomposed characters are added or to disallow 793 combining sequences entirely for those scripts (as has been 794 implicitly suggested for Arabic language use [RFC5564]). 796 [[CREF2: The above isn't quite right and probably needs additional 797 discussion and text.]] 799 3.3.4.2. The Han and Kangxu Cases 801 [[CREF3: .. to be supplied .. ]] 803 3.4. Confusion and the Casual User 805 To the extent to which predictability for relatively casual users is 806 a desired and important feather of relevant application or 807 application support protocols, it is probably worth observing that 808 the complex of rules and cases suggested or implied above is almost 809 certainly too involved for the typical such user to develop a good 810 intuitive understanding of how things behave and what relationships 811 exist. Conversely, the nature of writing systems for natural 812 languages, especially those that have evolved and diverged over 813 centuries, implies that no set of rules about allowable characters 814 will guarantee complete safety (however that is defined). 816 4. Implementation options and issues: Unicode properties, exceptions, 817 and the nature of stability 819 4.1. Unicode Stability compared to IETF (and ICANN) Stability 821 The various stability rules in Unicode [Unicode70-Stability] all 822 appear to be based on the model that once a value is assigned, it can 823 never be changed. That is probably appropriate for a character 824 coding system with multiple uses and applications. It is probably 825 the only option when normative relationships are expressed in tables 826 of values rather than by rules. One consequence of such a model is 827 that it is difficult or impossible to fix mistakes (for some 828 stability rules, the Unicode Standard does provide for exceptions) 829 and even harder to make adjustments that would normally be dictated 830 by evolution. 832 "No changes" provides a very strong and predictable type of 833 stability. There are many reasons to take that path. As in some of 834 the cases that motivated this document, the difficulty is that simply 835 adding new code points (in Unicode) or features (in a protocol or 836 application) may be destabilizing. One then has complete stability 837 for systems that never use or allow the new code points or features, 838 but rough edges for newer systems that see the discrepancies and 839 rough edges. IDNA2003 (inadvertently) took that approach by freezing 840 on Unicode 3.2 -- if no code points added after Unicode 3.2 had ever 841 been allowed, we would have had complete stability even as Unicode 842 libraries changed. Unicode has been quite ingenious about working 843 around those difficulties with such provisions as having code points 844 for newly-added precomposed characters decompose rather than altering 845 the normalization for the combining sequences. Other cases, such as 846 newly-added precomposed characters that do not decompose for, e.g., 847 language or phonetic reasons, are more problematic. 849 The IETF (and ICANN and standards development bodies such as ISO and 850 ISO/IEC JTC1) have generally adopted a different type of stability 851 model, one which considers experience in use and the ill effects of 852 not making changes as well as the disruptive effects of doing so. In 853 the IETF model, if an earlier decision is causing sufficient harm and 854 there is consensus in the communities that are most affected that a 855 change is desirable enough to make transition costs acceptable, then 856 the change is made. 858 The difference and its implications are perhaps best illustrated by a 859 disagreement when IDNA2008 was being approved. IDNA2003 had 860 effectively prevented some characters, notably (measured by intensity 861 of the protests) the Sharp S character (U+00DF) from being used in 862 DNS labels by mapping them to other characters before conversion to 863 ACE form. It has also prohibited some other code points, notably ZWJ 864 (U+200D) and ZWNJ (U+200C), by discarding them. In both cases, there 865 were strong voices from the relevant language communities, supported 866 by the registry communities, that the characters were important 867 enough that it was more desirable to undergo the short-term pain of a 868 transition and some uncertainty than to continue to exclude those 869 characters and the IDNA2008 rules and repertoire are consistent with 870 that preference. The Unicode Consortium apparently believed that 871 stability --elimination of any possibility of label invalidation or 872 different interpretations of the same string-- was more important 873 than those writing system requirements and community preferences. 874 That view was expressed through what was effectively a fork in (or 875 attempt to nullify) the IETF Standard [UTS46] a result that has 876 probably been worse for the overall Internet than either of the 877 possible decision choices. 879 4.2. New Unicode Properties 881 One suggestion about the way out of these problems would be to create 882 one or more new Unicode properties, maintained along with the rest of 883 Unicode, and then incorporated into new or modified rules or 884 categories in IDNA. Given the analysis in this document, it appears 885 that that property (or properties) would need to provide: 887 1. Identification of combining characters that, when used in 888 combining sequences, do not produce decomposable characters. 889 [[CREF4: Wording on the above is not quite right but, for the 890 present, maybe the intent is clear.]] 892 2. Identification of precomposed characters that might reasonably be 893 expected to decompose, but that do not. 895 3. Identification of character forms that are distinct only because 896 of language or phonetic distinctions within a script. 898 4. Identification of scripts for which precomposed forms are 899 strongly preferred and combining sequences should either be 900 viewed as temporary mechanisms until precomposed characters are 901 assigned or banned entirely. 903 5. Identification of code points that represent symbols for 904 specific, non-language, purposes even if identified as letters or 905 numerals by their General Property. This would include all 906 characters given separate code points because of specialized 907 "mathematical" and "phonetic" characters (see Section 3.3.2.2 and 908 Section 3.3.2.1), but there are probably additional cases. 910 Some of these properties (or characteristics or values of a single 911 property) would be suitable for disallowing characters, code points, 912 or contextual sequences that otherwise might be allowed by IDNA. 913 Others would be more suitable for making equality comparisons come 914 out as needed by IDNA, particularly to eliminate distinctions based 915 on language context. 917 While it would appear that appropriate rules and categories could be 918 developed for IDNA (and, presumably, for PRECIS, etc.) if the problem 919 areas are those identified in this document, it is not yet known 920 whether the list is complete (and, hence, whether additional 921 properties or information would be needed). 923 Even with such properties, IDNA would still almost certainly need 924 exception lists. In addition, it is likely that stability rules for 925 those properties would need to reflect IETF norms with arrangements 926 for bringing the IETF and other communities into the discussion when 927 tradeoffs are reviewed. 929 4.3. The need for exception lists 931 [[CREF5: Note in draft: this section is a partial placeholder and may 932 need more elaboration.]] 933 Issues with exception lists and the requirements for them are 934 discussed in Section 2 above and in RFC 5894 [RFC5894]. 936 5. Proposed/ Alternative Changes to RFC 5892 for the issues first 937 exposed by new code point U+08A1 939 NOTE IN DRAFT: See the comments in the Introduction, Section 1 and 940 the first paragraph of each Subsection below for the status of the 941 Subsections that follow. Each one, in combination with the material 942 in Section 3 above, also provides information about the reasons why 943 that particular strategy might or might not be appropriate. 945 When the term "Category" followed by an upper-case letter appears 946 below, it is s reference to a rule in RFC 5892. 948 5.1. Disallow This New Code Point 950 This option is almost certainly too Arabic-specific and does not 951 solve, or even address, the underlying problem. It also does not 952 inherently generalize to non-decomposing precomposed code points that 953 might be added in the future (whether to Arabic or other scripts) 954 even though one could add more code points to Category F in the same 955 way. 957 If chosen by the community, this subsection would update the portion 958 of the IDNA2008 specification that identifies rules for what 959 characters are permitted [RFC5892] to disallow that code point. 961 With the publication of this document, Section 2.6 ("Exceptions (F)") 962 of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in 963 Category F so that the rule itself reads: 965 F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660, 966 0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668, 967 0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6, 968 06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 08A1, 0F0B, 969 3007, 302E, 302F, 3031, 3032, 3033, 3034, 3035, 970 303B, 30FB} 972 and then add to the subtable designated 973 "DISALLOWED -- Would otherwise have been PVALID" 974 after the line that begins "07FA", the additional line: 976 08A1; DISALLOWED # ARABIC LETTER BEH WITH HAMZA ABOVE 978 This has the effect of making the cited code point DISALLOWED 979 independent of application of the rest of the IDNA rule set to the 980 current version of Unicode. Those wishing to create domain name 981 labels containing Beh with Hamza Above may continue to use the 982 sequence 984 U+0628, ARABIC LETTER BEH 985 followed by 987 U+0654, ARABIC HAMZA ABOVE 989 which was valid for IDNA purposes in Unicode 5.0 and earlier and 990 which continues to be valid. 992 In principle, much the same thing could be accomplished by using the 993 IDNA "BackwardCompatible" category (IDNA Category G, RFC 5892 994 Section 5.3). However, that category is described as applying only 995 when "property values in versions of Unicode after 5.2 have changed 996 in such a way that the derived property value would no longer be 997 PVALID or DISALLOWED". Because U+08A1 is a newly-added code point in 998 Unicode 7.0.0 and no property values of code points in prior versions 999 have changed, category G does not apply. If that section of RFC 5892 1000 were to be replaced in the future, perhaps consideration should be 1001 given to adding Normalization Stability and other issues to that 1002 description but, at present, it is not relevant. 1004 5.2. Disallow This New Code Point and All Future Precomposed Additions 1005 that Do Not Decompose 1007 At least in principle, the approach suggested above (Section 5.1) 1008 could be expanded to disallow all future allocations of non- 1009 decomposing precomposed characters. This would probably require 1010 either a new Unicode property to identify such characters and/or more 1011 emphasis on the manual, individual code point, checking of the new 1012 Unicode version review proces (i.e,. not just application of the 1013 existing rules and algorithm). It might require either a new rule in 1014 IDNA or a modification to the structure of Category F to make 1015 additions less tedious. It would do nothing for different ways to 1016 form identical characters within the same script that were not 1017 associated with decomposition and so would have to be used in 1018 conjunction with other appropaches. Finally, for scripts (such as 1019 Arabic) where there is a very strong preference to avoid combining 1020 sequences, this approach would exclude exactly the wrong set of 1021 characters. 1023 5.3. Disallow the combining sequences for these characters 1025 As in the approach discussed in Section 5.1, this approach is too 1026 Arabic-specific to address the more general problem. However, it 1027 illustrates a single-script approach and a possible mechanism for 1028 excluding combining sequences whose handling is connected to language 1029 information (information that, as discussed above, is not relevant to 1030 the DNS). 1032 If chosen by the community, this subsection would update the portion 1033 of the IDNA2008 specification that identifies contextual rules 1034 [RFC5892] to prohibit (combining) Hamza Above (U+0654) in conjunction 1035 with Arabic BEH (U+0628), HAH (U+062D), and REH (U+0631). Note that 1036 the choice of this option is consistent with the general preference 1037 for precomposed characters discussed above but would ban some labels 1038 that are valid today and that might, in principle, be in use. 1040 The required prohibition could be imposed by creating a new 1041 contextual rule in RFC 5892 to constrain combining sequences 1042 containing Hamza Above. 1044 As the Unicode Standard points out at some length [Unicode70-Arabic], 1045 Hamza is a problematic abstract character and the "Hamza Above" 1046 construction even more so. IDNA has historically associated 1047 characters whose use is reasonable in some contexts but not others 1048 with the special derived property "CONTEXTO" and then specified 1049 specific, context-dependent, rules about where they may be used. 1050 Because Hamza Above is problematic (and spawns edge cases, as 1051 discussed in the Unicode Standard section cited above), it was 1052 suggested that a contextual rule might be appropriate. There are at 1053 least two reasons why a contextual rule would not be suitable for the 1054 present situation. 1056 1. As discussed above, the present situation is a normalization 1057 stability and predictability problem, not a contextual one. Had 1058 the same issues arisen with a newly-added precomposed character 1059 that could previously be constructed from non-problematic base 1060 and combining characters, it would be even more clearly a 1061 normalization issue and, following the principles discussed there 1062 and particularly in UAX 15 [UAX15-Exclusion], might not have been 1063 assigned at all. 1065 2. The contextual rule sets are designed around restricting the use 1066 of code points to a particular script or adjacent to particular 1067 characters within that script. Neither of these cases applies to 1068 the newly-added character even if one could imagine rules for the 1069 use of Hamza Above (U+0654) that would reflect the considerations 1070 of Chapter 8 of Unicode 6.2. Even had the latter been desired, 1071 it would be somewhat late now -- Hamza Above has been present as 1072 a combining character (U+0654) in many versions of Unicode. 1073 While that section of the Unicode Standard describes the issues, 1074 it does not provide actionable guidance about what to do about it 1075 for cases going forward or when visual identity is important. 1077 5.4. Use Combinnig Classes to Develop Additional Contextual Rules 1079 This option may not be of any practical use, but Unicode supports a 1080 property called "Combining_Class". That property has been used in 1081 IDNA only to construct a contextual rule for Zero-Width Non-Joiner 1082 [RFC5892, Appendix A.1] but speculation has arisen during discussions 1083 of work on Arabic combining characters and rendering [UTR53] as to 1084 whether Combining Classes could be used to build additional 1085 contextual rules that would restrict problematic cases. Unless such 1086 rules were applied only to new code points, they would also not be 1087 backward compatable. 1089 The question of whether Combining Classes could be used to reduce the 1090 number of problematic labels is at least worth examination. 1092 5.5. Disallow all Combining Characters for Specific Scripts 1094 [[CREF6: This subsection needs to be turned into prose, but the 1095 follow bullet points are probably sufficient to identify the 1096 issues.]] 1097 o Might work for Arabic and other "precomposed preference" scripts 1098 if those can be identified in an orderly and stable way (see 1099 Section 3.3.4.1; recommended by the Arabic language community for 1100 IDNs [RFC5564]). 1102 o Unworkable for Latin because many characters that do not decompose 1103 are, at least in part, historical accidents resulting from 1104 combining prior national standards (this probably may exist for 1105 other scripts as well). 1107 o No effect at all on special-use representations of identical 1108 characters within a script (see Section 3.3.2.1 and 1109 Section 3.3.2.2). 1111 o Not backwards compatible. 1113 5.6. Do Nothing Other Than Warn 1115 A recommendation from UTC and others has been to simply warn 1116 registries, at all levels of the tree, to be careful with this set of 1117 characters. Doing that well would probably require making language 1118 distinctions within zones, which would violate the important IDNA 1119 principles that labels are not necessarily "words", do not carry 1120 language information, and may, at the protocol level, even 1121 deliberately mix languages and scripts. It is also problematic 1122 because the relevant set of characters is not easily defined in a 1123 precise way. This suggestion is problematic because the DNS and IDNA 1124 cannot make or enforce language distinctions, but it would avoid 1125 having the IETF either invalidate label strings that are potentially 1126 now in use or creating inconsistencies among the characters that 1127 combine with selected base characters but that also have precomposed 1128 forms that do not have decompositions. The potential would still 1129 exist for registries to respect the warning and deprecate such labels 1130 if they existed. 1132 More generally, while there are already requirements in IDNA for 1133 registries to be knowledgeable and responsible about the labels they 1134 register (a separate document discusses that requirement 1135 [Klensin-rfc5891bis]), experience indicates that those requirements 1136 are often ignored. At least as important, warning registries about 1137 what should or should not be registered and even calling out specific 1138 code points as dangerous and in need of extra attention 1139 [Freytag-dangerous] does nothing to address the many cases in which 1140 lookup-time checking for IDNA conformance and deliberately misleading 1141 label constructions is important. 1143 5.7. Normalization Form IETF (NFI)) 1145 The most radical possibility for the comparison issue would be to 1146 decide that none of the Unicode Normalization Forms specified in UAX 1147 15 [UAX15] are adequate for use with the DNS because, contrary to 1148 their apparent descriptions, normalization tables are actually 1149 determined using language information. However, use of language 1150 information is unacceptable for IDNA for reasons described elsewhere 1151 in this document. The remedy would be to define an IETF-specific (or 1152 DNS-specific) normalization form (sometimes called "NFI" in 1153 discussions), building on NFC but adhering strictly to the rule that 1154 normalization causes two different forms of the same character (glyph 1155 image) within the same script to be treated as equal. In practice 1156 such a form could be implemented for IDNA purposes as an additional 1157 rule within RFC 5892 (and its successors) that constituted an 1158 exception list for the NFC tables. For this set of characters, the 1159 special IETF normalization form would be equivalent to the exclusion 1160 discussed in Section 5.3 above. 1162 An Internet-identifier-specific normalization form, especially if 1163 specified somewhat separately from the IDNA core, would have a small 1164 marginal advantage over the other strategies in this section (or in 1165 combination with some of them), even though most of the end result 1166 and much of the implementation would be the same in practice. While 1167 the design of IDNA requires that strings be normalized as part of the 1168 process of determining label validity (and hence before either 1169 storage of values in the DNS or name resolution), there is an ongoing 1170 debate about whether normalization should be performed before storing 1171 a string or putting it on the wire or only when the string is 1172 actually compared or otherwise used. 1174 If a normalization procedure with the right properties for the IETF 1175 was defined, that argument could be bypassed and the best decisions 1176 made for different circumstances. The separation would also allow 1177 better comparison of strings that lack language context in 1178 applications environments in which the additional processing and 1179 character classifications of IDNA and/or PRECIS were not applicable. 1180 Having such a normalization procedure defined outside IDNA would also 1181 minimize changes to IDNA itself, which is probably an advantage. 1183 If the new normalizstion form were, in practice, simply an overlay on 1184 NFC with modifications dictated by exception and/or property lists, 1185 keeping its definition separate from IDNA would also avoid 1186 interweaving those exceptions and property lists with the rules and 1187 categories of IDNA itself, avoiding some unnecessary complexity. 1189 6. Editorial clarification to RFC 5892 1191 Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a 1192 clarification to Appendix A and Section A.1 of RFC 5892. This 1193 section of this document updates the RFC to apply that clarification. 1195 1. In Appendix A, add a new paragraph after the paragraph that 1196 begins "The code point...". The new paragraph should read: 1198 "For the rule to be evaluated to True for the label, it MUST be 1199 evaluated separately for every occurrence of the Code point in 1200 the label; each of those evaluations must result in True." 1202 2. In Appendix A, Section A.1, replace the "Rule Set" by 1204 Rule Set: 1205 False; 1206 If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True; 1207 If cp .eq. \u200C And 1208 RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp 1209 (Joining_Type:T)*(Joining_Type:{R,D})) Then True; 1211 7. Acknowledgements 1213 The Unicode 7.0.0 changes were extensively discussed within the IAB's 1214 Internationalization Program. The authors are grateful for the 1215 discussions and feedback there, especially from Andrew Sullivan and 1216 David Thaler. Additional information was requested and received from 1217 Mark Davis and Ken Whistler and while they probably do not agree with 1218 the necessity of excluding this code point or taking even more 1219 drastic action as their responsibility is to look at the Unicode 1220 Consortium requirements for stability, the decision would not have 1221 been possible without their input. Thanks to Bill McQuillan and Ted 1222 Hardie for reading versions of the document carefully enough to 1223 identify and report some confusing typographical errors. Several 1224 experts and reviewers who prefer to remain anonymous also provided 1225 helpful input and comments on preliminary versions of this document. 1227 8. IANA Considerations 1229 When the IANA registry and tables are updated to reflect Unicode 1230 7.0.0, changes should be made according to the decisions the IETF 1231 makes about Section 5. 1233 9. Security Considerations 1235 From at least one point of view, this document is entirely a 1236 discussion of a security issue or set of such issues. While the 1237 "similar-looking characters" issue that has been a concern since the 1238 earliest days of IDNs [HomographAttack] and that has driven assorted 1239 "character confusion" projects [ICANN-VIP], if a user types in a 1240 string on one device and can get different results that do not 1241 compare equal when it is typed on a different device (with both 1242 behaving correctly and both keyboards appearing to be the same and 1243 for the same script) then all security mechanism that depend on the 1244 underlying identifiers, including the practical applications of DNS 1245 response integrity checks via DNSSEC [RFC4033] and DNS-embedded 1246 public key mechanisms [RFC6698], are at risk if different parties, at 1247 least one of them malicious, obtain or register some of the 1248 identical-appearing and identically-typed strings and get them into 1249 appropriate zones. 1251 Mechanisms that depend on trusting registration systems (e.g., 1252 registries and registrars in the DNS IDN case, see Section 5.6 above) 1253 are likely to be of only limited utility because fully-qualified 1254 domains that may be perfectly reasonable at the first level or two of 1255 the DNS may have differences of this type deep in the tree, into 1256 levels where name management, and often accountability, are weak. 1257 Similar issues obviously apply when names are user-selected or 1258 unmanaged. 1260 When the issue is not a deliberate attack but simple accidental 1261 confusion among similar strings, most of our strategies depend on the 1262 acceptability of false negatives on matching if there is low risk of 1263 false positives (see, for example, the discussion of false negatives 1264 in identifier comparison in Section 2.1 of RFC 6943 [RFC6943]). 1265 Aspects of that issue appear in, for example, RFC 3986 [RFC3986] and 1266 the PRECIS effort [RFC8264]. However, because the cases covered here 1267 are connected, not just to what the user sees but to what is typed 1268 and where, there is an increased risk of false positives (accidental 1269 as well as deliberate). 1271 [[CREF7: Note in Draft: The paragraph that follows was written for a 1272 much earlier version of this document. It is obsolete, but is being 1273 retained as a placeholder for future developments.]] 1275 This specification excludes a code point for which the Unicode- 1276 specified normalization behavior could result in two ways to form a 1277 visually-identical character within the same script not comparing 1278 equal. That behavior could create a dream case for someone intending 1279 to confuse the user by use of a domain name that looked identical to 1280 another one, was entirely in the same script, but was still 1281 considered different. 1283 Internet Security in areas that involve internationalized identifiers 1284 that might contain the relevant characters is therefore significantly 1285 dependent on some effective resolution for the issues identified in 1286 this document, not just hand waving, devout wishes, or appointment of 1287 study committees about it. 1289 10. References 1291 10.1. Normative References 1293 [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", 1294 BCP 137, RFC 5137, DOI 10.17487/RFC5137, February 2008, 1295 . 1297 [RFC5890] Klensin, J., "Internationalized Domain Names for 1298 Applications (IDNA): Definitions and Document Framework", 1299 RFC 5890, DOI 10.17487/RFC5890, August 2010, 1300 . 1302 [RFC5892] Faltstrom, P., Ed., "The Unicode Code Points and 1303 Internationalized Domain Names for Applications (IDNA)", 1304 RFC 5892, DOI 10.17487/RFC5892, August 2010, 1305 . 1307 [RFC5892Erratum] 1308 "RFC5892, "The Unicode Code Points and Internationalized 1309 Domain Names for Applications (IDNA)", August 2010, Errata 1310 ID: 3312", Errata ID 3312, August 2012, 1311 . 1313 [RFC5894] Klensin, J., "Internationalized Domain Names for 1314 Applications (IDNA): Background, Explanation, and 1315 Rationale", RFC 5894, DOI 10.17487/RFC5894, August 2010, 1316 . 1318 [RFC6943] Thaler, D., Ed., "Issues in Identifier Comparison for 1319 Security Purposes", RFC 6943, DOI 10.17487/RFC6943, May 1320 2013, . 1322 [RFC8264] Saint-Andre, P. and M. Blanchet, "PRECIS Framework: 1323 Preparation, Enforcement, and Comparison of 1324 Internationalized Strings in Application Protocols", 1325 RFC 8264, DOI 10.17487/RFC8264, October 2017, 1326 . 1328 [UAX15] Davis, M., Ed., "Unicode Standard Annex #15: Unicode 1329 Normalization Forms", June 2014, 1330 . 1332 [UAX15-Exclusion] 1333 "Unicode Standard Annex #15: ob. cit., Section 5", 1334 . 1337 [UAX15-Versioning] 1338 "Unicode Standard Annex #15, ob. cit., Section 3", 1339 . 1341 [Unicode5] 1342 The Unicode Consortium, "The Unicode Standard, Version 1343 5.0", ISBN 0-321-48091-0, 2007. 1345 Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0. 1346 This printed reference has now been updated online to 1347 reflect additional code points. For code points, the 1348 reference at the time RFC 5890-5894 were published is to 1349 Unicode 5.2. 1351 [Unicode62] 1352 The Unicode Consortium, "The Unicode Standard, Version 1353 6.2.0", ISBN 978-1-936213-07-8, 2012, 1354 . 1356 Preferred citation: The Unicode Consortium. The Unicode 1357 Standard, Version 6.2.0, (Mountain View, CA: The Unicode 1358 Consortium, 2012. ISBN 978-1-936213-07-8) 1360 [Unicode7] 1361 The Unicode Consortium, "The Unicode Standard, Version 1362 7.0.0", ISBN 978-1-936213-09-2, 2014, 1363 . 1365 Preferred Citation: The Unicode Consortium. The Unicode 1366 Standard, Version 7.0.0, (Mountain View, CA: The Unicode 1367 Consortium, 2014. ISBN 978-1-936213-09-2) 1369 [Unicode70-Arabic] 1370 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter 1371 9.2: Arabic", Chapter 9, 2014, 1372 . 1374 Subsection titled "Encoding Principles", paragraph 1375 numbered 4, starting on page 362. 1377 [Unicode70-CompatDecomp] 1378 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter 1379 2.3: Compatibility Characters", Chapter 2, 2014, 1380 . 1382 Subsection titled "Compatibility Decomposable Characters" 1383 starting on page 26. 1385 [Unicode70-Design] 1386 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter 1387 2.2: Unicode Design Principles", Chapter 2, 2014, 1388 . 1390 [Unicode70-Hamza] 1391 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter 1392 9.2: Arabic", Chapter 9, 2014, 1393 . 1395 Subsection titled "Combining Hamza Above" starting on page 1396 378. 1398 [Unicode70-Overlay] 1399 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter 1400 2.2: Unicode Design Principles", Chapter 2, 2014, 1401 . 1403 Subsection titled "Non-decomposition of Overlaid 1404 Diacritics" starting on page 64. 1406 [Unicode70-Stability] 1407 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter 1408 2.2: Unicode Design Principles", Chapter 2, 2014, 1409 . 1411 Subsection titled "Stability" starting on page 23 and 1412 containing a link to http://www.unicode.org/policies/ 1413 stability_policy.html.. 1415 [UTS46] Davis, M. and M. Suignard, "Unicode Technical Standard 1416 #46: Unicode IDNA Compatibility Processing", 1417 Version 7.0.0, June 2014, 1418 . 1420 10.2. Informative References 1422 [Dalby] Dalby, A., "Dictionary of Languages: The definitive 1423 reference to more than 400 languages", Columbia Univeristy 1424 Press , 2004. 1426 pages 206-207 1428 [Daniels] Daniels, P. and W. Bright, "The World's Writing Systems", 1429 Oxford University Press , 1986. 1431 [Freytag-dangerous] 1432 Freytag, A., Klensin, J., and A. Sullivan, "Those 1433 Troublesome Characters: A Registry of Unicode Code Points 1434 Needing Special Consideration When Used in Network 1435 Identifiers", June 2017, 1436 . 1439 [HomographAttack] 1440 Gabrilovich, E. and A. Gontmakher, "The Homograph Attack", 1441 Communications of the ACM 45(2):128, February 2002, 1442 . 1445 [ICANN-VIP] 1446 ICANN, "The IDN Variant Issues Project: A Study of Issues 1447 Related to the Management of IDN Variant TLDs (Integrated 1448 Issues Report)", February 2012, 1449 . 1452 [Klensin-rfc5891bis] 1453 Klensin, J., "Internationalized Domain Names in 1454 Applications (IDNA): Registry Restrictions and 1455 Recommendations", September 2017, 1456 . 1459 [Omniglot-Fula] 1460 Ager, S., "Omniglot: Fula (Fulfulde, Pulaar, 1461 Pular'Fulaare)", 1462 . 1464 Captured 2015-01-07 1466 [RFC0020] Cerf, V., "ASCII format for network interchange", STD 80, 1467 RFC 20, DOI 10.17487/RFC0020, October 1969, 1468 . 1470 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 1471 "Internationalizing Domain Names in Applications (IDNA)", 1472 RFC 3490, DOI 10.17487/RFC3490, March 2003, 1473 . 1475 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode 1476 for Internationalized Domain Names in Applications 1477 (IDNA)", RFC 3492, DOI 10.17487/RFC3492, March 2003, 1478 . 1480 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1481 Resource Identifier (URI): Generic Syntax", STD 66, 1482 RFC 3986, DOI 10.17487/RFC3986, January 2005, 1483 . 1485 [RFC4033] Arends, R., Austein, R., Larson, M., Massey, D., and S. 1486 Rose, "DNS Security Introduction and Requirements", 1487 RFC 4033, DOI 10.17487/RFC4033, March 2005, 1488 . 1490 [RFC5564] El-Sherbiny, A., Farah, M., Oueichek, I., and A. Al-Zoman, 1491 "Linguistic Guidelines for the Use of the Arabic Language 1492 in Internet Domains", RFC 5564, DOI 10.17487/RFC5564, 1493 February 2010, . 1495 [RFC6452] Faltstrom, P., Ed. and P. Hoffman, Ed., "The Unicode Code 1496 Points and Internationalized Domain Names for Applications 1497 (IDNA) - Unicode 6.0", RFC 6452, DOI 10.17487/RFC6452, 1498 November 2011, . 1500 [RFC6698] Hoffman, P. and J. Schlyter, "The DNS-Based Authentication 1501 of Named Entities (DANE) Transport Layer Security (TLS) 1502 Protocol: TLSA", RFC 6698, DOI 10.17487/RFC6698, August 1503 2012, . 1505 [Unicode32] 1506 The Unicode Consortium, "The Unicode Standard, Version 1507 3.2.0". 1509 The Unicode Standard, Version 3.2.0 is defined by The 1510 Unicode Standard, Version 3.0 (Reading, MA, Addison- 1511 Wesley, 2000. ISBN 0-201-61633-5), as amended by the 1512 Unicode Standard Annex #27: Unicode 3.1 1513 (http://www.unicode.org/reports/tr27/) and by the Unicode 1514 Standard Annex #28: Unicode 3.2 1515 (http://www.unicode.org/reports/tr28/). 1517 [UTR53] Unicode Consortium, "Proposed Draft: Unicode Technical 1518 Report #53: Unicode Arabic Mark Ordering Algorithm", 1519 August 2017, . 1521 Note: this is a Proposed Draft, out for public review when 1522 this version of the current I-D is posted, and should not 1523 be considered either an approved/ final document or a 1524 stable reference. 1526 Appendix A. Change Log 1528 RFC Editor: Please remove this appendix before publication. 1530 A.1. Changes from version -00 (2014-07-21)to -01 1532 o Version 01 of this document is an extensive rewrite and 1533 reorganization, reflecting discussions with UTC members and adding 1534 three more options for discussion to the original proposal to 1535 simply disallow the new code point. 1537 A.2. Changes from version -01 (2014-12-07) to -02 1539 Corrected a typographical error in which Hamza Above was incorrectly 1540 listed with the wrong code point. 1542 A.3. Changes from version -02 (2014-12-07) to -03 1544 Corrected a typographical error in the Abstract in which RFC 5892 was 1545 incorrectly shown as 5982. 1547 A.4. Changes from version -03 (2015-01-06) to -04 1549 o Explicitly identified the applicability of U+08A1 with Fula and 1550 added references that discuss that language and how it is written. 1552 o Updated several Unicode 6.2 references to point to Unicode 7.0 1553 since the latter is now available in stable form (it was done when 1554 work on this I-D started). 1556 o Extensively revised to discuss the non-Arabic cases, non- 1557 decomposing diacritics, other types of characters that don't 1558 compare equal after normalization, and more general problem and 1559 approaches. 1561 A.5. Changes from version -04 (2015-03-11) to -05 1563 o Modified a few citation labels to make them more obvious. 1565 o Restructured Section 1 and added additional terminology comments. 1567 o Added discussion about non-decomposable character cases, including 1568 the "slash" example, and associated references for which -04 1569 contained only placeholders. 1571 o The examples and discussion of Latin script issues has been 1572 expanded considerably. It is unfortunate that many readers in the 1573 IETF community apparently cannot understand examples well enough 1574 to believe a problem is significant unless they is a discussion of 1575 Latin script examples, but, at least for this working draft, that 1576 is the way it is. 1578 o Rewrote the discussion of several of the alternatives and added 1579 the discussion of combining classes. 1581 o Rewrote and extended the discussion of the "warn only" 1582 alternative. 1584 o Several other sections modified to improve technical or editorial 1585 clarity. 1587 o Note that, while some references have been updated, others have 1588 not. In particular, Unicode references are still tied to versions 1589 6 or 7. In some cases, those non-historical references are and 1590 will remain appropriate; others will best be replaced with 1591 information about current versions of documents. 1593 Authors' Addresses 1595 John C Klensin 1596 1770 Massachusetts Ave, Ste 322 1597 Cambridge, MA 02140 1598 USA 1600 Phone: +1 617 245 1457 1601 Email: john-ietf@jck.com 1602 Patrik Faltstrom 1603 Netnod 1604 Franzengatan 5 1605 Stockholm 112 51 1606 Sweden 1608 Phone: +46 70 6059051 1609 Email: paf@netnod.se