idnits 2.17.1 draft-klensin-idna-5892upd-unicode70-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == It seems as if not all pages are separated by form feeds - found 28 form feeds but 744 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 1033: '...ated to True for the label, it MUST be...' -- The draft header indicates that this document updates RFC5892, but the abstract doesn't seem to directly say this. It does mention RFC5892 though, so this could be OK. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year (Using the creation date from RFC5892, updated by this document, for RFC5378 checks: 2008-04-26) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 10, 2015) is 3335 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'PRECIS-Framework' -- Duplicate reference: RFC5892, mentioned in 'RFC5892Erratum', was also mentioned in 'RFC5892'. ** Downref: Normative reference to an Informational RFC: RFC 5894 ** Downref: Normative reference to an Informational RFC: RFC 6943 -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15' -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Exclusion' -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Versioning' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTS46' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicod70-CompatDecomp' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicod70-Overlay' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode5' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode7' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode70-Arabic' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode70-Design' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode70-Hamza' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode70-Stability' -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 19 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Klensin 3 Internet-Draft 4 Updates: 5892, 5894 (if approved) P. Faltstrom 5 Intended status: Standards Track Netnod 6 Expires: September 11, 2015 March 10, 2015 8 IDNA Update for Unicode 7.0.0 9 draft-klensin-idna-5892upd-unicode70-04.txt 11 Abstract 13 The current version of the IDNA specifications anticipated that each 14 new version of Unicode would be reviewed to verify that no changes 15 had been introduced that required adjustments to the set of rules 16 and, in particular, whether new exceptions or backward compatibility 17 adjustments were needed. The review for Unicode 7.0.0 first 18 identified a potentially problematic new code point and then a much 19 more general and difficult issue with Unicode normalization. This 20 specification discusses those issues and proposes updates to IDNA 21 and, potentially, the way the IETF handles comparison of identifiers 22 more generally, especially when there is no associated language or 23 language identification. It also applies an editorial clarification 24 to RFC 5892 that was the subject of an earlier erratum and updates 25 RFC 5894 to point to the issues involved. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on September 11, 2015. 44 Copyright Notice 46 Copyright (c) 2015 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 62 2. Document Aspirations . . . . . . . . . . . . . . . . . . . . 6 63 3. Problem Description . . . . . . . . . . . . . . . . . . . . . 7 64 3.1. IDNA assumptions about Unicode normalization . . . . . . 7 65 3.2. The discovery and the Arabic script cases . . . . . . . . 9 66 3.2.1. New code point U+08A1, decomposition, and language 67 dependency . . . . . . . . . . . . . . . . . . . . . 9 68 3.2.2. Other examples of the same behavior within the Arabic 69 Script . . . . . . . . . . . . . . . . . . . . . . . 10 70 3.2.3. Hamza and Combining Sequences . . . . . . . . . . . . 10 71 3.3. Precomposed characters without decompositions more 72 generally . . . . . . . . . . . . . . . . . . . . . . . . 11 73 3.3.1. Description of the general problem . . . . . . . . . 11 74 3.3.2. Latin Examples and Cases . . . . . . . . . . . . . . 12 75 3.3.3. Examples and Cases from Other Scripts . . . . . . . . 14 76 3.3.4. Scripts with precomposed preferences and ones with 77 combining preferences . . . . . . . . . . . . . . . . 15 78 3.4. Confusion and the casual user . . . . . . . . . . . . . . 15 79 4. Implementation options and issues: Unicode properties, 80 exceptions, and the nature of stability . . . . . . . . . . . 15 81 4.1. Unicode Stability compared to IETF (and ICANN) Stability 15 82 4.2. New Unicode Properties . . . . . . . . . . . . . . . . . 17 83 4.3. The need for exception lists . . . . . . . . . . . . . . 18 84 5. Proposed/ Alternative Changes to RFC 5892 for the issues 85 first exposed by new code point U+08A1 . . . . . . . . . . . 18 86 5.1. Disallow This New Code Point . . . . . . . . . . . . . . 18 87 5.2. Disallow This New Code Point and All Future Precomposed 88 Additions that do not decompose . . . . . . . . . . . . . 19 89 5.3. Disallow the combining sequences for these characters . . 19 90 5.4. Disallow all Combining Characters for Specific Scripts . 21 91 5.5. Do Nothing Other Than Warn . . . . . . . . . . . . . . . 21 92 5.6. Normalization Form IETF (NFI)) . . . . . . . . . . . . . 21 93 6. Editorial clarification to RFC 5892 . . . . . . . . . . . . . 22 94 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 23 95 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 23 96 9. Security Considerations . . . . . . . . . . . . . . . . . . . 23 97 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 24 98 10.1. Normative References . . . . . . . . . . . . . . . . . . 24 99 10.2. Informative References . . . . . . . . . . . . . . . . . 27 100 Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 28 101 A.1. Changes from version -00 to -01 . . . . . . . . . . . . . 28 102 A.2. Changes from version -01 to -02 . . . . . . . . . . . . . 28 103 A.3. Changes from version -02 to -03 . . . . . . . . . . . . . 29 104 A.4. Changes from version -03 to -04 . . . . . . . . . . . . . 29 105 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 29 107 1. Introduction 109 Note in/about -04 Draft: This version of the document contains a 110 very large amount of new material as compared to the -03 version. 111 The new material reflects an evolution of community understanding 112 in the last two months from an assumption that the problem 113 involved only a few code points and one combining character in a 114 single script (Hamza Above and Arabic) to an understanding that it 115 is quite pervasive and may represent fundamental misunderstandings 116 or omissions from IDNA2008 (and, by extension, the basics of 117 PRECIS [PRECIS-Framework]) that must be corrected if those 118 protocols are going to be used in a way that supports Internet 119 internationalized identifiers predictability (as seen by the end 120 user) and security. 122 This version is still necessarily incomplete: not only is our 123 understanding probably still not comprehensive, but there are a 124 number of placeholders for text and references. Nonetheless, the 125 document in its current form should be useful as both the 126 beginning of a comprehensive overview is the issues and a source 127 of references to other relevant materials. 129 This draft could almost certainly be organized better to improve 130 its readability: specific suggestion would be welcome. 132 The current version of the IDNA specifications, known as "IDNA2008" 133 [RFC5890], anticipated that each new version of Unicode would be 134 reviewed to verify that no changes had been introduced that required 135 adjustments to IDNA's rules and, in particular, whether new 136 exceptions or backward compatibility adjustments were needed. When 137 that review was carefully conducted for Unicode 7.0.0 [Unicode7], 138 comparing it to prior versions including the text in Unicode 6.2 139 [Unicode62], it identified a problematic new code point (U+08A1, 140 ARABIC LETTER BEH WITH HAMZA ABOVE). The code point was added for 141 use with the Fula (also known as Fulfulde, Pulaar, amd Pular'Fulaare) 142 language, a language that, apparently, is most often written in Latin 143 characters today [Omniglot-Fula] [Dalby] [Daniels]. 145 The specific problem is discussed in detail in Section 3. In very 146 broad terms, IDNA (and other IETF work) assume that, if one can 147 represent "the same character" either as a combining sequence or as a 148 single code point, strings that are identical except for those 149 alternate forms will compare equal after normalization. Part of the 150 difficulty that has characterized this discussion is that "the same" 151 differs depending on the criteria that are chosen. 153 The behavior of the newly-added code point, while non-optimal for 154 IDNA, follows that of a few code points that predate Unicode 7.x and 155 even the IDNA 2008 specifications and Unicode 6.0. Those existing 156 code points, which may not be easy to accurately characterize as a 157 group, make the question of what, if anything, to do about this new 158 exceedingly problematic one and, perhaps separately, what to do about 159 existing sets of code points with the same behavior, because 160 different reasonable criteria yield different decisions, 161 specifically: 163 o To disallow it (and future, but not existing characters with 164 similar characteristics) as an IDNA exception case creates 165 inconsistencies with how those earlier code points were handled. 167 o To disallow it and the similar code points as well would 168 necessitate invalidating some potential labels that would have 169 been valid under IDNA2008 until this time. Depending on how the 170 collection of similar code points is characterized, a few of them 171 are almost certainly used in reasonable labels. 173 o To permit the new code point to be treated as PVALID creates a 174 situation in which it is possible, within the same script, to 175 compose the same character symbol (glyph) in two different ways 176 that do not compare equal even after normalization. That 177 condition would then apply to it and the earlier code points with 178 the same behavior. That situation contradicts a fundamental 179 assumption of IDNA that is discussed in more detail below. 181 NOTE IN DRAFT: 183 This working draft discusses six alternatives, including an idea 184 (an IETF-specific normalization form) that seemed too drastic to 185 be considered a few months ago. However, it not only would have 186 been appropriate to discuss when the IDNA2008 specifications were 187 being developed but is appearing more attractive now. The authors 188 suggest that the community discuss the relevant tradeoffs and make 189 a decision and that the document then be revised to reflect that 190 decision, with the other alternatives discussed as options not 191 chosen. Because there is no ideal choice, the discussion of the 192 issues in Section 3, is probably as or more important than the 193 particular choice of how to handle this code point. In addition 194 to providing information for this document, that section should be 195 considered as an updating addendum to RFC 5894 [RFC5894] and 196 should be incorporated into any future revision of that document. 198 As the result of this version of the document containing several 199 alternate proposals, some of the text is also a little bit 200 redundant. That will be corrected in future versions. 202 As anticipated when IDNA2008, and RFC 5892 in particular, were 203 written, exceptions and explicit updates are likely to be needed only 204 if there is disagreement between the Unicode Consortium's view about 205 what is best for the Standard and the IETF's view of what is best for 206 IDNs, the DNS, and IDNA. It was hoped that a situation would never 207 arise in which the the two perspectives would disagree, but the 208 possibility was anticipated and considerable mechanism added to RFC 209 5890 and 5982 as a result. It is probably important to note that a 210 disagreement in this context does not imply that anyone is "wrong", 211 only that the two different groups have different needs and therefore 212 criteria about what is acceptable. For that reason, the IETF has, in 213 the past, allowed some characters for IDNA that active Unicode 214 Technical Committee members suggested be disallowed to avoid a change 215 in derived tables [RFC6452]. This document describes a case where 216 the IETF should disallow a character or characters that the various 217 properties would otherwise treat as PVALID. 219 This document provides the "flagging for the IESG" specified by 220 Section 5.1 of RFC 5892. As specified there, the change itself 221 requires IETF review because it alters the rules of Section 2 of that 222 document. 224 [[RFC Editor: please remove the following comment and note if they 225 get to you.]] 227 [[IESG: It might not be a bad idea to incorporate some version of 228 the following into the Last Call announcement.]] 230 NOTE IN DRAFT to IETF Reviewers: The issues in this document, and 231 particularly the choices among options for either adding exception 232 cases to RFC 5892 or ignoring the issue, warning people, and 233 hoping the results do not include serious problems, are fairly 234 esoteric. Understanding them requires that one have at least some 235 understanding of how the Arabic Script (and perhaps other scripts 236 in which precomposed characters are preferred over combining 237 sequences as a Unicode design and extension principle) works and 238 the reasons the Unicode Standard gives various Arabic Script 239 characters a fairly extended discussion [Unicode70-Arabic]. It 240 also requires understanding of a number of Unicode principles, 241 including the Normalization Stability rules [UAX15-Versioning] as 242 applied to new precomposed characters and guidelines for adding 243 new characters. There is considerable discussion of the issues in 244 Section 3 and references are provided for those who want to pursue 245 them, but potential reviewers should assume that the background 246 needed to understand the reasons for this change is no less deep 247 in the subject matter than would be expected of someone reviewing 248 a proposed change in, e.g., the fundamentals of BGP, TCP 249 congestion control, or some cryptographic algorithm. Put more 250 bluntly, one's ability to read or speak languages other than 251 English, or even one or more languages that use the Arabic script 252 or other scripts similarly affected, does not make one an expert 253 in these matters. 255 This document assumes that the reader is reasonably familiar with the 256 terminology of IDNA [RFC5890] and Unicode [Unicode7] and with the 257 IETF conventions for representing Unicode code points [RFC5137]. 258 Some terms used here may not be used in the same way in those two 259 sets of documents. From one point of view, those differences may 260 have been the results of, or led to, misunderstandings that may, in 261 turn, be part of the root cause of the problems explored in this 262 document. In particular, this document uses the term "precomposed 263 character" to describe characters that could reasonably be composed 264 by a combining sequence using code points in the same but for which a 265 single code point that does not require combining sequences is 266 available. That definition is strictly about mechanical composition 267 and does not involve any considerations about how the character is 268 used. It is closely related to this document's definition of 269 "identical". When a precomposed character exists and either applying 270 NFC to the combining sequence does not yield that character or 271 applying NFD to that character's code point does not yield the 272 combining sequence, it is referred to in this document as "non- 273 decomposable" 275 2. Document Aspirations 277 This document, in its present form, is not a proposal for a solution. 278 Instead, it is intended to be (or evolve into) a comprehensive 279 description of the issues and problems and to outline some possible 280 approaches to a solution. A perfect solution -- one that would 281 resolve all of the issues identified in this document, would involve 282 a relatively small set of relatively simple rules and hence would be 283 comprehensible and predictable for and by non-expert end users, would 284 not require code point by code point or even block by block exception 285 lists, and would not leave uses of any script or language feeling 286 that their particular writing system have been treated less fairly 287 than others. 289 Part of the reality we need to accept is that IDNA, in its present 290 form, represents compromises that does not completely satisfy those 291 criteria and whatever is done about these issues will probably make 292 it (or the job of administering zones containing IDNs) more complex. 293 Similarly, as the Unicode Standard suggests when it identifies ten 294 Design Principles and the text then says "Not all of these principles 295 can be satisfied simultaneously..." [Unicode70-Design], while there 296 are guidelines and principles, a certain amount of subjective 297 judgment is involved in making determinations about normalization, 298 decomposition, and some property values. For Unicode itself, those 299 issues are resolved by multiple statements (at least one cited below) 300 that one needs to rely on per-code point information in the Unicode 301 Character Database rather than on rules or principles. The design of 302 IDNA and the effort to keep it largely independent of Unicode 303 versions requires rules, categories, and principles that can be 304 relied upon and applied algorithmically. There is obviously some 305 tension between the two approaches. 307 3. Problem Description 309 3.1. IDNA assumptions about Unicode normalization 311 IDNA makes several assumptions about Unicode, Unicode "characters", 312 and the effects of normalization. Those assumptions were based on 313 careful reading of the Unicode Standard at the time [Unicode5], 314 guided by advice and commitments by members of the Unicode Technical 315 Committee. Those assumptions, and the associated requirements, are 316 necessitated by three properties of DNS labels that typically do not 317 apply to blocks of running text: 319 1. There is no language context for a label. While particular DNS 320 zones may impose restrictions, including language or script 321 restrictions, on what labels can be registered, neither the DNS 322 nor IDNA impose either type of restriction or give the user of a 323 label any indication about the registration or other restrictions 324 that may have been imposed. 326 2. Labels are often mnemonics rather than words in any language. 327 They may be abbreviations or acronyms or contain embedded digits 328 and have other characteristics that are not typical of words. 330 3. Labels are, in practice, usually short. Even when they are the 331 maximum length allowed by the DNS and IDNA, they are typically 332 too short to provide significant context. Statements that 333 suggest that languages can almost always be determined from 334 relatively short paragraphs or equivalent bodies of text do not 335 apply to DNS labels because of their typical short length and 336 because, as noted above, they are not required to be formed 337 according to language-based rules. 339 At the same time, because the DNS is an exact-match system, there 340 must be no ambiguity about whether two labels are equal. Although 341 there have been extensive discussions about "confusingly similar" 342 characters, labels, and strings, such tests between scripts are 343 always somewhat subjective: they are affected by choices of type 344 styles and by what the user expects to see. In spite of the fact 345 that the glyphs that represent many characters in different scripts 346 are identical in appearance (e.g., basic Latin "a" (U+0061) and the 347 identical-appearing Cyrillic character (U+0430), the most important 348 test is that, if two glyphs are the same within a given script, they 349 must represent the same character no matter how they are formed. 351 Unicode normalization, as explained in [UAX15], is expected to 352 resolve those "same script, same glyph, different formation methods" 353 issues. Within the Latin script, the code point sequence for lower 354 case "o" (U+006F) and combining diaeresis (U+0308) will, when 355 normalized using the "NFC" method required by IDNA, produce the 356 precomposed small letter o with diaeresis (U+00F6) and hence the two 357 ways of forming the character will compare equal (and the combining 358 sequence is effectively prohibited from U-labels). 360 NFC was preferred over other normalization methods for IDNA because 361 it is more compact, more likely to be produced on keyboards on which 362 the relevant characters actually appeared, and because it does not 363 lose substantive information (e.g., some types of compatibility 364 equivalence involves judgment calls as to whether two characters are 365 actually the same -- they may be "the same" in some contexts but not 366 others -- while canonical equivalence is about different ways to 367 produce the glyph for the same abstract character). 369 IDNA also assumed that the extensive Unicode stability rules would be 370 applied and work as specified when new code points were added. Those 371 rules, as described in The Unicode Standard and the normative annexes 372 identified below, provide that: 374 1. New code points representing precomposed characters that can be 375 formed from combining sequences will not be added to Unicode 376 unless neither the relevant base character nor required combining 377 character(s) are part of the Standard within the relevant script 378 [UAX15-Versioning]. 380 2. If circumstances require that principle be violated, 381 normalization stability requires that the newly-added character 382 decompose (even under NFC) to the previously-available combining 383 sequence [UAX15-Exclusion]. 385 At least at the time IDNA2008 was being developed, there was no 386 explicit provision in the Standard's discussion of conditions for 387 adding new code points, nor of normalization stability, for an 388 exception based on different languages using the same script or 389 ambiguities about the shape or positioning of combining characters. 391 3.2. The discovery and the Arabic script cases 393 While the set of problems with normalization discussed above were 394 discovered with a newly-added code point for the Arabic Script and 395 some characteristics of Unicode handling of that script seem to make 396 the problem more complex going forward, these are not issues specific 397 to Arabic. This section describes the Arabic-specific problems; 398 subsequent ones (starting with Section 3.3) discuss the problem more 399 generally and include illustrations from other scripts. 401 3.2.1. New code point U+08A1, decomposition, and language dependency 403 Unicode 7.0.0 introduces the new code point U+08A1, ARABIC LETTER BEH 404 WITH HAMZA ABOVE. As can be deduced from the name, it is visually 405 identical to the glyph that can be formed from a combining sequence 406 consisting of the code point for ARABIC LETTER BEH (U+0628) and the 407 code point for Combining Hamza Above (U+0654). The two rules 408 summarized above (see the last part of Section 3.1) suggest that 409 either the new code point should not be allocated at all or that it 410 should have a decomposition to \u'0628'\u'0654'. 412 Had the issues outlined in this document been better understood at 413 the time, it probably would have been wise for RFC 5892 to disallow 414 either the precomposed character or the combining sequence of each 415 pair in those cases in which Unicode normalization rules do not cause 416 the right thing to happen, i.e., the combining sequence and 417 precomposed character to be treated as equivalent. Failure to do so 418 at the time places an extra burden on registries to be sure that 419 conflicts (and the potential for confusion and attacks) do not exist. 420 Oddly, had the exclusion been made part of the specification at that 421 time, the preference for precomposed forms noted above would probably 422 have dictated excluding the combining sequence, something not 423 otherwise done in IDNA2008 because the NFC requirement serves the 424 same purpose. Today, the only thing that can be excluded without the 425 potential disruption of disallowing a previously-PVALID combining 426 sequence is the to exclude the newly-added code point so whatever is 427 done, or might have been contemplated with hindsight, will be 428 somewhat inconsistent. 430 3.2.2. Other examples of the same behavior within the Arabic Script 432 One of the things that complicates the issue with the new U+08A1 code 433 point is that there are several other Arabic-script code points that 434 behave in the same way for similar language-specific reasons. 436 In particular, at least three other grapheme clusters that have been 437 present for many version of Unicode can be seen as involving issues 438 similar to those for the newly-added ARABIC LETTER BEH WITH HAMZA 439 ABOVE. ARABIC LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER 440 REH WITH HAMZA ABOVE (U+076C) do not have decomposition forms and are 441 preferred over combining sequences using HAMZA ABOVE (U+0654) 442 [Unicode70-Hamza]. By contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE 443 (U+0623) decomposes into \u'0627'\u'0654', ARABIC LETTER WAW WITH 444 HAMZA ABOVE (U+0624) decomposes into \u'0648'\u'0654', and ARABIC 445 LETTER YEH WITH HAMZA ABOVE (U+0626) decomposes into \u'064A'\u'0654' 446 so the precomposed character and combining sequences compare equal 447 when both are normalized, as this specification prefers. 449 There are other variations in which a precomposed character involving 450 HAMZA ABOVE has a decomposition to a combining sequence that can form 451 it. For example, ARABIC LETTER U WITH HAMZA ABOVE (U+0677) has a 452 compatibility decomposition. but not a canonical one, into the 453 combining sequence \u'06C7'\u'0674'. 455 3.2.3. Hamza and Combining Sequences 457 As the Unicode Standard points out at some length [Unicode70-Arabic], 458 Hamza is a problematic abstract character and the "Hamza Above" 459 construction even more so [Unicode70-Hamza]. Those sections explain 460 a distinction made by Unicode between the use of a Hamza mark to 461 denote a glottal stop and one used as a diacritic mark to denote a 462 separate letter. In the first case, the combining sequence is used. 463 In the second, a precomposed character is assigned. 465 Unlike Unicode generally and because of concerns about identifier 466 spoofing and attacks based on similarities, character distinctions in 467 IDNA are based much more strictly on the appearance of characters; 468 language and pronunciation distinctions within a script are not 469 considered. So, for IDNA, BEH WITH HAMZA ABOVE is not-quite- 470 tautologically the same as BEH WITH HAMZA ABOVE, even if one of them 471 is written as U+08A1 (new to Unicode 7.0.0) and the other as the 472 sequence \u'0628'\u'0654' (feasible with Unicode 7.0.0 but also 473 available in versions of Unicode going back at least to the version 474 [Unicode32] used in the original version of IDNA [RFC3490]. Because 475 the precomposed form and combining sequence are, for IDNA purposes, 476 the same, IDNA expects that normalization (specifically the 477 requirement that all U-labels be in NFC form) will cause them to 478 compare equal. 480 If Unicode also considered them the same, then the principle would 481 apply that new precomposed ("composition") forms are not added unless 482 one of the code points that could be used to construct it did not 483 exist in an earlier version (and even then is discouraged) 484 [UAX15-Versioning]. When exceptions are made, they are expected to 485 conform to the rules and classes in the "Composition Exclusion 486 Table", with class 2 being relevant to this case [UAX15-Exclusion]. 487 That rule essentially requires that the normalization for the old 488 combining sequence to itself be retained (for stability) but that the 489 newly-added character be treated as canonically decomposable and 490 decompose back to the older sequence even under NFC. That was not 491 done for this particular case, presumably because of the distinction 492 about pronunciation modifiers versus separate letters noted above. 493 Because, for IDNA and the DNS, there is a possibility that the 494 composing sequence \u'0628'\u'0654' already appears in labels, the 495 only choice other than allowing an otherwise-identical, and 496 identically-appearing, label with U+08A1 substituted to identify a 497 different DNS entry is to DISALLOW the new character. 499 3.3. Precomposed characters without decompositions more generally 501 3.3.1. Description of the general problem 503 As mentioned above, IDNA made a strong assumption that, if there were 504 two ways to form the same abstract character in the same script, 505 normalization would result in them comparing equal. Work on IDNA2008 506 recognized that early version of Unicode might also contain some 507 inconsistencies; see Section 3.3.2.4 below. 509 Having precomposed code points exist that don't have decompositions, 510 or having them allocated in the future, is problematic for those IDNA 511 assumptions about character comparison, and seems to call for either 512 excludng some set of code points that IDNA's rules do not now 513 identify, to develop and use a normalization procedure that behaves 514 as expected (those two options may be nearly equivalent for many 515 purposes) or deciding to accept a risk that, apparently, will only 516 increase over time. 518 It is not clear whether the reasons the IDNABIS WG did not understand 519 and allow for these cases are important except insofar as they inform 520 considerations about what to do in the future. It seemed (and still 521 seems to some people) that the Unicode Standard is very clear on the 522 matter. In addition to the normalization stability rules cited in 523 the last part of Section 3.1. the discussion in the Core Standard 524 seems quite clear. For example, "Where characters are used in 525 different ways in different languages, the relevant properties are 526 normally defined outside the Unicode Standard" in Section 2.2, 527 subsection titled "Semantics" [Unicode7] did not suggest to most 528 readers that sometime separate code points would be allocated within 529 a script based on language considerations. Similarly, the same 530 section of the Standard says, in a subsection titled "Unification", 531 "The Unicode Standard avoids duplicate encoding of characters by 532 unifying them within scripts across language" and does not list 533 exceptions to that rule or limit it to a single script although it 534 goes on to list "CJK" as an example. Another subsection, "Equivalent 535 Sequences" indicates "Common precomposed forms ... are included for 536 compatibility with current standards. For static precomposed forms, 537 the standard provides a mapping to an equivalent dynamically composed 538 sequence of characters". The latter appears to be precisely the "all 539 precomposed characters decompose into the relevant combining 540 sequences if the relevant base and combining characters exist in the 541 Standard" that IDNA needs and assumed and, again, there is no mention 542 of exceptions, language-dependent of otherwise. The summary of 543 stabiiity policies cited in the Standard [Unicode70-Stability] does 544 not appear to shed any additional light on these issues. 546 The Standard now contains a subsection titled "Non-decomposition of 547 Overlaid Diacritics" [Unicod70-Overlay] that identifies a list of 548 diacritics that do not normally form characters that have 549 decompositions. The rule given has its own exceptions and the text 550 clearly states that there is actually no way to know whether a code 551 point has a decomposition other than consulting the Unicode Character 552 Database entry for that code point. The subsequent section notes 553 that this can be a security problem; while the issues with IDNA go 554 well beyond what is normally considered security, that comment now 555 seems clear. While that subsection is helpful in explaining the 556 problem, especially for European scripts, it does not appear in the 557 Unicode versions that were current when IDNA2008 was being developed. 559 3.3.2. Latin Examples and Cases 561 While this set of problems was discovered because of a code point 562 added to the Arabic script in precombined form to support a 563 particular language, there are actually far more examples for, e.g., 564 Latin script than there are for Arabic script. Many of them are 565 associated with the "non-decomposition of combining diacriticals" 566 issues mentioned above, but the next subsections describe other cases 567 that are not directly bound to decomposition. 569 3.3.2.1. The font exclusion and compatability relationships 571 Unicode contains a large collection of characters that are identified 572 as "Mathematical Symbols". A large subset of them are basic or 573 decorated Latin characters, differing from the ordinary ones only by 574 their usage and, in appearance, by font or type styling (despite the 575 general principle that font distinctions are not used as the basis 576 for assigning separate code points. Most of these have canonical 577 mappings to the base form, which eliminates them from IDNA, but 578 others do not and, because the same marks that are used as phonetic 579 diacritical markings in conventional alphabetical use have special 580 mathematical meanings, applications that permit the use of these 581 characters have their own issues with normalization and equality. 583 3.3.2.2. The phonetic notation characters and extensions 585 Another example involves various Phonetic Alphabet and Extension 586 characters. many of which, unlike the Mathematical ones, do not have 587 normalizations that would make them compare equal to the basic 588 characters with essentially identical representations. This would 589 not be a problem for IDNA if they were identified with a specialize 590 script or as symbols rather than letters, but neither is the case: 591 they are generally identified as lower case Latin Script letters even 592 when they are visually upper-case, another issue for IDNA. 594 3.3.2.3. Combineng dots and other shapes combine... unless... 596 The discussion of "Non-decomposition of Overlaid Diacritics" 597 [Unicod70-Overlay] indirectly exhibits at least one reason why it has 598 been difficult to characterize the problem. If one combines that 599 subsection with others, one gets a set of rules that might be 600 described as: 602 1. If the precomposed character and the code points that make up the 603 combining sequence exist, then canonical composition and 604 decomposition work as expected, except... 606 2. If the precomposed character was added to Unicode after the code 607 points that make up the combining sequence, normalization 608 stability for the combining sequences requires that NFC applied 609 to the precomposed character decomposes rather than having the 610 combining sequence compose to the new character, however... 612 3. If the combining sequence involves a diacritic or other mark that 613 actually touches the base character when composed, the 614 precomposed character does not have a decomposition, unless... 616 4. The combining diacritic involved is Cedilla (U+0327), Ogonek 617 (U+0328), or Horn (U+031B), in which case the precomposed 618 characters that contain them "regularly" (but presumably not 619 always), and... 621 5. There are further exceptions for Hamza (which does not overlay 622 the associated base character in the same way the Latin-derived 623 combining diacritics and other marks do. Those decisions to 624 decompose a precomposed character (or not) are based on language 625 or phonetic considerations, not the combining mechanism or 626 appearance, or perhaps,... 628 6. Some characters have compatibility decompositions rather than 629 canonical ones [Unicod70-CompatDecomp]. Because compatibility 630 relationships are treated differently by IDNA, PRECIS 631 [PRECIS-Framework], and, potentially, other protocols involving 632 identifiers for Internet use, the existence of compatibility 633 relationship may or may not be helpful. Finally,... 635 7. There is no reason to believe the above list is complete. In 636 particular, if whether a precomposed character decomposes or not 637 is determined by language or phonetic distinctions, one may need 638 additional rules on a per-script and/or per-character basis. 640 The above list only covers the cases involving combining sequences. 641 It does not cover cases such as those in Section 3.3.2.1 and 642 Section 3.3.2.2 and there may be additional groups of cases not yet 643 identified. 645 3.3.2.4. "Legacy" characters and new additions 647 The development of categories and rules for IDNA recognized that 648 early version of Unicode might contain some inconsistencies if 649 evaluated using more contemporary rules about code point assignments 650 and stability. In particular, there might be some exceptions from 651 different practices in early version of Unicode or anomalies caused 652 by copying existing single- or dual-script standards into Unicode as 653 block rather than individual character additions to the repertoire. 654 The possibility of such "legacy" exceptions was one reason why the 655 IDNA category rules include explicit provisions for exception lists 656 (even though no such code points were identified prior to 2014). 658 3.3.3. Examples and Cases from Other Scripts 660 Research into these issues has not yet turned up a comprehensive list 661 of affected scripts and code points. As discussed elsewhere in this 662 document, it is clear that Arabic and Latin Scripts are significantly 663 affected, that some Han and Kangxu radicals and ideographs are 664 affected, and that other examples do exist -- it is just not known 665 how many of those examples there are and what patterns, if any, 666 characterize them. 668 3.3.4. Scripts with precomposed preferences and ones with combining 669 preferences 671 While the authors have been unable to find an explanation for the 672 differentiation in the Unicode Standard, we have been told that there 673 are differences among scripts as to whether the action preference is 674 to add new combining sequences only (and resist adding precomposed 675 characters) as suggested in Section 3.3.2.3 or to add precomposed 676 characters, often ones that do not have decompositions. If those 677 difference in preference do exist, it is probably important to have 678 them documented so that they can be reflected in IDNA review 679 procedures and elsewhere. It will also require IETF discussion of 680 whether combining sequences should be deprecated when the 681 corresponding precomposed characters are added or to disallow 682 combining sequences entirely for those scripts (as has been 683 implicitly suggested for Arabic language use [RFC5564]). 685 [[CREF1: The above isn't quite right and probably needs additional 686 discussion and text.]] 688 3.4. Confusion and the casual user 690 To the extent to which predictability for relatively casual users is 691 a desired and important feather of relevant application or 692 application support protocols, it is probably worth observing that 693 the complex of rules and cases above is almost certainly too involved 694 for the typical such user to develop a good intuitive understanding 695 of how things behave and what relationships exist. 697 4. Implementation options and issues: Unicode properties, exceptions, 698 and the nature of stability 700 4.1. Unicode Stability compared to IETF (and ICANN) Stability 702 The various stability rules in Unicode [Unicode70-Stability] all 703 appear to be based on the model that once a value is assigned, it can 704 never be changed. That is probably appropriate for a character 705 coding system with multiple uses and applications. It is probably 706 the only option when normative relationships are expressed in tables 707 of values rather than by rules. One consequence of such a model is 708 that it is difficult or impossible to fix mistakes (for some 709 stability rules, the Unicode Standard does provide for exceptions) 710 and even harder to make adjustments that would normally be dictated 711 by evolution. 713 "No changes" provides a very strong and predictable type of stability 714 and there are many reasons to take that path. As in some of the 715 cases that motivated this document, the difficulty is that simply 716 adding new code points (in Unicode) or features (in a protocol or 717 application) may be destabilizing. One then has complete stability 718 for systems that never use or allow the new code points or features, 719 but rough edges for newer systems that see the discrepancies and 720 rough edges. IDNA2003 (inadvertently) took that approach by freezing 721 on Unicode 3.2 -- if no code points added after Unicode 3.2 had ever 722 been allowed, we would have had complete stability even as Unicode 723 libraries changed. Unicode has been quite ingenious about working 724 around those difficulties with such provisions as having code points 725 for newly-added precomposed characters decompose rather than altering 726 the normalization for the combining sequences. Other cases, such as 727 newly-added precomposed characters that do not decompose for, e.g., 728 language or phonetic reasons, are more problematic. 730 The IETF (and ICANN and standards development bodies such as ISO and 731 ISO/IEC JTC1) have generally adopted a different type of stability 732 model, one which considers experience in use and the ill effects of 733 not making changes as well as the disruptive effects of doing so. In 734 the IETF model, if an earlier decision is causing sufficient harm and 735 there is consensus in the communities that are most affected that a 736 change is desirable enough to make transition costs acceptable, then 737 the change is made. 739 The difference and its implications are perhaps best illustrated by a 740 disagreement when IDNA2008 was being approved. IDNA2003 had 741 effectively prevented some characters, notably (measured by intensity 742 of the protests) the Sharp S character (U+00DF) from being used in 743 DNS labels by mapping them to other characters before conversion to 744 ACE form. It has also prohibited some other code points, notably ZWJ 745 (U+200D) and ZWNJ (U+200C), by discarding them. In both cases, there 746 were strong voices from the relevant language communities, supported 747 by the registry communities, that the characters were important 748 enough that it was more desirable to undergo the short-term pain of a 749 transition and some uncertainty than to continue to exclude those 750 characters and the IDNA2008 rules and repertoire are consistent with 751 that preference. The Unicode Consortium apparently believed that 752 stability --elimination of any possibility of label invalidation or 753 different interpretations of the same string-- was more important 754 than those writing system requirements and community preferences. 755 That view was expressed through what was effectively a fork in (or 756 attempt to nullify) the IETF Standard [UTS46] a result that has 757 probably been worse for the overall Internet than either of the 758 possible decision choices. 760 4.2. New Unicode Properties 762 One suggestion about the way out of these problems would be to create 763 one or more new Unicode properties, maintained along with the rest of 764 Unicode, and then incorporated into new or modified rules or 765 categories in IDNA. Given the analysis in this document, it appears 766 that that property (or properties) would need to provide: 768 1. Identification of combining characters that, when used in 769 combining sequences, do not produce decomposable characters. 770 [[CREF2: Wording on the above is not quite right but, for the 771 present, maybe the intent is clear.]] 773 2. Identification of precomposed characters that might reasonably be 774 expected to decompose, but that do not. 776 3. Identification of character forms that are distinct only because 777 of language or phonetic distinctions within a script. 779 4. Identification of scripts for which precomposed forms are 780 strongly preferred and combining sequences should either be 781 viewed as temporary mechanisms until precomposed characters are 782 assigned or banned entirely. 784 5. Identification of code points that represent symbols for 785 specific, non-language, purposes even if identified as letters or 786 numerals by their General Property (see Section 3.3.2.2 and 787 Section 3.3.2.1). 789 Some of these properties (or characteristics or values of a single 790 property) would be suitable for disallowing characters, code points, 791 or contextual sequences that otherwise might be allowed by IDNA. 792 Others would be more suitable for making equality comparisons come 793 out as needed by IDNA, particularly to eliminate distinctions based 794 on language context. 796 While it would appear that appropriate rules and categories could be 797 developed for IDNA (and, presumably, for PRECIS, etc.) if the problem 798 areas are those identified in this document, it is not yet known 799 whether the list is complete (and, hence, whether additional 800 properties or information would be needed. 802 Even with such properties, IDNA would still almost certainly need 803 exception lists. In addition, it is likely that stability rules for 804 those properties would need to reflect IETF norms with arrangements 805 for bringing the IETF and other communities into the discussion when 806 tradeoffs are reviewed. 808 4.3. The need for exception lists 810 [[CREF3: Note in draft: this section is a partial placeholder and may 811 need more elaboration.]] 812 Issues with exception lists and the requirements for them are 813 discussed in Section 2 above and RFC 5894 [RFC5894]. 815 5. Proposed/ Alternative Changes to RFC 5892 for the issues first 816 exposed by new code point U+08A1 818 NOTE IN DRAFT: See the comments in the Introduction, Section 1 and 819 the first paragraph of each Subsection below for the status of the 820 Subsections that follow. Each one, in combination with the material 821 in Section 3 above, also provides information about the reasons why 822 that particular strategy might or might not be appropriate. 824 5.1. Disallow This New Code Point 826 This option is almost certainly too Arabic-specific and does not 827 solve, or even address, the underlying problem. It also does not 828 inherently generalize to non-decomposing precomposed code points that 829 might be added in the future (whether to Arabic or other scripts) 830 even though one could add more code points to Category F in the same 831 way. 833 If chosen by the community, this subsection would update the portion 834 of the IDNA2008 specification that identifies rules for what 835 characters are permitted [RFC5892] to disallow that code point. 837 With the publication of this document, Section 2.6 ("Exceptions (F)") 838 of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in 839 Category F so that the rule itself reads: 841 F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660, 842 0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668, 843 0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6, 844 06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 08A1, 0F0B, 845 3007, 302E, 302F, 3031, 3032, 3033, 3034, 3035, 846 303B, 30FB} 848 and then add to the subtable designated 849 "DISALLOWED -- Would otherwise have been PVALID" 850 after the line that begins "07FA", the additional line: 852 08A1; DISALLOWED # ARABIC LETTER BEH WITH HAMZA ABOVE 854 This has the effect of making the cited code point DISALLOWED 855 independent of application of the rest of the IDNA rule set to the 856 current version of Unicode. Those wishing to create domain name 857 labels containing Beh with Hamza Above may continue to use the 858 sequence 860 U+0628, ARABIC LETTER BEH 861 followed by 863 U+0654, ARABIC HAMZA ABOVE 865 which was valid for IDNA purposes in Unicode 5.0 and earlier and 866 which continues to be valid. 868 In principle, much the same thing could be accomplished by using the 869 IDNA "BackwardCompatible" category (IDNA Category G, RFC 5892 870 Section 5.3). However, that category is described as applying only 871 when "property values in versions of Unicode after 5.2 have changed 872 in such a way that the derived property value would no longer be 873 PVALID or DISALLOWED". Because U+08A1 is a newly-added code point in 874 Unicode 7.0.0 and no property values of code points in prior versions 875 have changed, category G does not apply. If that section of RFC 5892 876 were to be replaced in the future, perhaps consideration should be 877 given to adding Normalization Stability and other issues to that 878 description but, at present, it is not relevant. 880 5.2. Disallow This New Code Point and All Future Precomposed Additions 881 that do not decompose 883 At least in principle, the approach suggested above (Section 5.1) 884 could be expanded to disallow all future allocations of non- 885 decomposing precomposed characters. This would probably require 886 either a new Unicode property to identify such characters and/or more 887 emphasis on the manual, individual code point, checking of the new 888 Unicode version review proces (i.e,. not just application of the 889 existing rules and algorithm). It might require either a new rule in 890 IDNA or a modification to the structure of Category F to make 891 additions less tedious. It would do nothing for different ways to 892 form identical characters within the same script that were not 893 associated with decomposition and so would have to be used in 894 conjunction with other appropaches. Finally, for scripts (such as 895 Arabic) where there is a very strong preference to avoid combining 896 sequences, this approach would exclude exactly the wrong set of 897 characters. 899 5.3. Disallow the combining sequences for these characters 901 As in the approach discussed in Section 5.1, this approach is too 902 Arabic-specific to address the more general problem. However, it 903 illustrates a single-script approach and a possible mechanism for 904 excluding combining sequences whose handling is connected to language 905 information (information that, as discussed above, is not relevant to 906 the DNS). 908 If chosen by the community, this subsection would update the portion 909 of the IDNA2008 specification that identifies contextual rules 910 [RFC5892] to prohibit (combining) Hamza Above (U+0654) in conjunction 911 with Arabic BEH (U+0628), HAH (U+062D), and REH (U+0631). Note that 912 the choice of this option is consistent with the general preference 913 for precomposed characters discussed above but would ban some labels 914 that are valid today and that might, in principle, be in use. 916 The required prohibition could be imposed by creating a new 917 contextual rule in RFC 5892 to constrain combining sequences 918 containing Hamza Above. 920 As the Unicode Standard points out at some length [Unicode70-Arabic], 921 Hamza is a problematic abstract character and the "Hamza Above" 922 construction even more so. IDNA has historically associated 923 characters whose use is reasonable in some contexts but not others 924 with the special derived property "CONTEXTO" and then specified 925 specific, context-dependent, rules about where they may be used. 926 Because Hamza Above is problematic (and spawns edge cases, as 927 discussed in the Unicode Standard section cited above), it was 928 suggested that a contextual rule might be appropriate. There are at 929 least two reasons why a contextual rule would not be suitable for the 930 present situation. 932 1. As discussed above, the present situation is a normalization 933 stability and predictability problem, not a contextual one. Had 934 the same issues arisen with a newly-added precomposed character 935 that could previously be constructed from non-problematic base 936 and combining characters, it would be even more clearly a 937 normalization issue and, following the principles discussed there 938 and particularly in UAX 15 [UAX15-Exclusion], might not have been 939 assigned at all. 941 2. The contextual rule sets are designed around restricting the use 942 of code points to a particular script or adjacent to particular 943 characters within that script. Neither of these cases applies to 944 the newly-added character even if one could imagine rules for the 945 use of Hamza Above (U+0654) that would reflect the considerations 946 of Chapter 8 of Unicode 6.2. Even had the latter been desired, 947 it would be somewhat late now -- Hamza Above has been present as 948 a combining character (U+0654) in many versions of Unicode. 949 While that section of the Unicode Standard describes the issues, 950 it does not provide actionable guidance about what to do about it 951 for cases going forward or when visual identity is important. 953 5.4. Disallow all Combining Characters for Specific Scripts 955 [[CREF4: This subsevtion needs to be turned into prose, but the 956 follow bullet points are probably sufficient to identify the 957 issues.]] 959 Might work for Arabic and other "precomposed preference" scripts (see 960 Section 3.3.4; recommended by the Arabic language community for IDNs 961 [RFC5564]. Hopeless for Latin. Backwards incompatible. No effect 962 at all on special-use representations of identical characters within 963 a script (see Section 3.3.2.1 and Section 3.3.2.2). 965 5.5. Do Nothing Other Than Warn 967 The recommendation from UTC is to simply warn registries, at all 968 levels of the tree, to be careful with this set of characters, making 969 language distinctions within zones. Because the DNS cannot make or 970 enforce language distinctions, this suggestion is problematic but it 971 would avoid having the IETF either invalidating label strings that 972 are potentially now in use or creating inconsistencies among the 973 characters that combine with Hamza Above but that also have 974 precomposed forms that do not have decompositions. The potential 975 would still exist for registries to respect the warning and deprecate 976 such labels if they existed. 978 5.6. Normalization Form IETF (NFI)) 980 The most radical possibility for the comparison issue would be to 981 decide that none of the Unicode Normalization Forms specified in UAX 982 15 [UAX15] are adequate for use with the DNS because, contrary to 983 their apparent descriptions, normalization tables are actually 984 determined using language information. However, use of language 985 information is unacceptable for IDNA for reasons described elsewhere 986 in this document. The remedy would be to define an IETF-specific (or 987 DNS-specific) normalization form (sometimes called "NFI" in 988 discussions), building on NFC but adhering strictly to the rule that 989 normalization causes two different forms of the same character (glyph 990 image) within the same script to be treated as equal. In practice 991 such a form could be implemented for IDNA purposes as an additional 992 rule within RFC 5892 (and its successors) that constituted an 993 exception list for the NFC tables. For this set of characters, the 994 special IETF normalization form would be equivalent to the exclusion 995 discussed in Section 5.3 above. 997 An Internet-specific normalization form, especially if specified 998 somewhat separately from the IDNA core, would have a small marginal 999 advantage over the other strategies in this section (or in 1000 combination with some of them), even though most of the end result 1001 and much of the implementation would be the same in practice. While 1002 the design of IDNA requires that strings be normalized as part of the 1003 process of determining label validity (and hence before either 1004 storage of values in the DNS or name resolution), there is an ongoing 1005 debate about whether normalization should be performed before storing 1006 a string or putting it on the wire or only when the string is 1007 actually compared or otherwise used. 1009 If a normalization procedure with the right properties for the IETF 1010 was defined, that argument could be bypassed and the best decisions 1011 made for different circumstances. The separation would also allow 1012 better comparison of strings that lack language context in 1013 applications environments in which the additional processing and 1014 character classifications of IDNA and/or PRECIS were not applicable. 1015 Having such a normalization procedure defined outside IDNA would also 1016 minimize changes to IDNA itself, which is probably an advantage. 1018 If the new normalizstion form were, in practice, simply an overlay on 1019 NFC with modifications dictated by exception and/or property lists, 1020 keeping its definition separate from IDNA would also avoid 1021 interweaving those exceptions and property lists with the rules and 1022 categories of IDNA itself, avoiding some unnecessary complexity. 1024 6. Editorial clarification to RFC 5892 1026 Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a 1027 clarification to Appendix A and Section A.1 of RFC 5892. This 1028 section of this document updates the RFC to apply that clarification. 1030 1. In Appendix A, add a new paragraph after the paragraph that 1031 begins "The code point...". The new paragraph should read: 1033 "For the rule to be evaluated to True for the label, it MUST be 1034 evaluated separately for every occurrence of the Code point in 1035 the label; each of those evaluations must result in True." 1037 2. In Appendix A, Section A.1, replace the "Rule Set" by 1039 Rule Set: 1040 False; 1041 If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True; 1042 If cp .eq. \u200C And 1043 RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp 1044 (Joining_Type:T)*(Joining_Type:{R,D})) Then True; 1046 7. Acknowledgements 1048 The Unicode 7.0.0 changes were extensively discussed within the IAB's 1049 Internationalization Program. The authors are grateful for the 1050 discussions and feedback there, especially from Andrew Sullivan and 1051 David Thaler. Additional information was requested and received from 1052 Mark Davis and Ken Whistler and while they probably do not agree with 1053 the necessity of excluding this code point or taking even more 1054 drastic action as their responsibility is to look at the Unicode 1055 Consortium requirements for stability, the decision would not have 1056 been possible without their input. Thanks to Bill McQuillan and Ted 1057 Hardie for reading versions of the document carefully enough to 1058 identify and report some confusing typographical errors. Several 1059 experts and reviewers who prefer to remain anonymous also provided 1060 helpful input and comments on preliminary versions of this document. 1062 8. IANA Considerations 1064 When the IANA registry and tables are updated to reflect Unicode 1065 7.0.0, changes should be made according to the decisions the IETF 1066 makes about Section 5. 1068 9. Security Considerations 1070 From at least one point of view, this document is entirely a 1071 discussion of a security issue or set of such issues. While the 1072 "similar-looking characters" issue that has been a concern since the 1073 earliest days of IDNs [HomographAttack] and that has driven assorted 1074 "character confusion" projects [ICANN-VIP], if a user types in a 1075 string on one device and can get different results that do not 1076 compare equal when it is typed on a different device (with both 1077 behaving correctly and both keyboards appearing to be the same and 1078 for the same script) then all security mechanism that depend on the 1079 underlying identifiers, including the practical applications of DNS 1080 response integrity checks DNSSEC [RFC4033] and DNS-embedded public 1081 key mechanisms [RFC6698], are at risk if different parties, at least 1082 one of them malicious, obtain some of the identical-appearing and 1083 identically-typed strings. 1085 Mechanisms that depend on trusting registration systems (e.g., 1086 registries and registrars in the DNS IDN case, see Section 5.5 above) 1087 are likely to be of only limited utility because fully-qualified 1088 domains that may be perfectly reasonable at the first level or two of 1089 the DNS may have differences of this type deep in the tree, into 1090 levels where name management is weak. Similar issues obviously apply 1091 when names are user-selected or unmanaged. 1093 When the issue is not a deliberate attack but simple accidental 1094 confusion among similar strings, most of our strategies depend on the 1095 acceptability of false negatives on matching if there is low risk of 1096 false positives (see, for example, the discussion of false negatives 1097 in identifier comparison in Section 2.1 of RFC 6943 [RFC6943]). 1098 Aspects of that issue appear in, for example, RFC 3986 [RFC3986] and 1099 the PRECIS effort [PRECIS-Framework]. But, because the cases covered 1100 here are connected, not just to what the user sees but to what is 1101 typed and where, there is an increased risk of false positives 1102 (accidental as well as deliberate). 1104 [[CREF5: Note in Draft: The paragraph that follows was written for a 1105 much earlier version of this document. It is obsolete, but is being 1106 retained as a placeholder for future developments.]] 1107 This specification excludes a code point for which the Unicode- 1108 specified normalization behavior could result in two ways to form a 1109 visually-identical character within the same script not comparing 1110 equal. That behavior could create a dream case for someone intending 1111 to confuse the user by use of a domain name that looked identical to 1112 another one, was entirely in the same script, but was still 1113 considered different. 1115 Internet Security in areas that involve internationalized identifiers 1116 that might contain the relevant characters is therefore significantly 1117 dependent on some effective resolution for the issues identified in 1118 this document, not just hand waving, devout wishes, or appointment of 1119 study committees about it. 1121 10. References 1123 10.1. Normative References 1125 [PRECIS-Framework] 1126 Saint-Andre, P. and M. Blanchet, "PRECIS Framework: 1127 Preparation, Enforcement, and Comparison of 1128 Internationalized Strings in Application Protocols", 1129 February 2015, . 1132 [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", BCP 1133 137, RFC 5137, February 2008. 1135 [RFC5890] Klensin, J., "Internationalized Domain Names for 1136 Applications (IDNA): Definitions and Document Framework", 1137 RFC 5890, August 2010. 1139 [RFC5892] Faltstrom, P., "The Unicode Code Points and 1140 Internationalized Domain Names for Applications (IDNA)", 1141 RFC 5892, August 2010. 1143 [RFC5892Erratum] 1144 "RFC5892, "The Unicode Code Points and Internationalized 1145 Domain Names for Applications (IDNA)", August 2010, Errata 1146 ID: 3312", Errata ID 3312, August 2012, 1147 . 1149 [RFC5894] Klensin, J., "Internationalized Domain Names for 1150 Applications (IDNA): Background, Explanation, and 1151 Rationale", RFC 5894, August 2010. 1153 [RFC6943] Thaler, D., "Issues in Identifier Comparison for Security 1154 Purposes", RFC 6943, May 2013. 1156 [UAX15] Davis, M., Ed., "Unicode Standard Annex #15: Unicode 1157 Normalization Forms", June 2014, 1158 . 1160 [UAX15-Exclusion] 1161 "Unicode Standard Annex #15: ob. cit., Section 5", 1162 . 1165 [UAX15-Versioning] 1166 "Unicode Standard Annex #15, ob. cit., Section 3", 1167 . 1169 [UTS46] Davis, M. and M. Suignard, "Unicode Technical Standard 1170 #46: Unicode IDNA Compatibility Processing", Version 1171 7.0.0, June 2014, . 1173 [Unicod70-CompatDecomp] 1174 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter 1175 2.3: Compatibility Characters", Chapter 2, 2014, 1176 . 1178 Subsection titled "Compatibility Decomposable Characters" 1179 starting on page 26. 1181 [Unicod70-Overlay] 1182 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter 1183 2.2: Unicode Design Principles", Chapter 2, 2014, 1184 . 1186 Subsection titled "Non-decomposition of Overlaid 1187 Diacritics" starting on page 64. 1189 [Unicode5] 1190 The Unicode Consortium, "The Unicode Standard, Version 1191 5.0", ISBN 0-321-48091-0, 2007. 1193 Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0. 1194 This printed reference has now been updated online to 1195 reflect additional code points. For code points, the 1196 reference at the time RFC 5890-5894 were published is to 1197 Unicode 5.2. 1199 [Unicode62] 1200 The Unicode Consortium, "The Unicode Standard, Version 1201 6.2.0", ISBN 978-1-936213-07-8, 2012, 1202 . 1204 Preferred citation: The Unicode Consortium. The Unicode 1205 Standard, Version 6.2.0, (Mountain View, CA: The Unicode 1206 Consortium, 2012. ISBN 978-1-936213-07-8) 1208 [Unicode7] 1209 The Unicode Consortium, "The Unicode Standard, Version 1210 7.0.0", ISBN 978-1-936213-09-2, 2014, 1211 . 1213 Preferred Citation: The Unicode Consortium. The Unicode 1214 Standard, Version 7.0.0, (Mountain View, CA: The Unicode 1215 Consortium, 2014. ISBN 978-1-936213-09-2) 1217 [Unicode70-Arabic] 1218 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter 1219 9.2: Arabic", Chapter 9, 2014, 1220 . 1222 Subsection titled "Encoding Principles", paragraph 1223 numbered 4, starting on page 362. 1225 [Unicode70-Design] 1226 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter 1227 2.2: Unicode Design Principles", Chapter 2, 2014, 1228 . 1230 [Unicode70-Hamza] 1231 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter 1232 9.2: Arabic", Chapter 9, 2014, 1233 . 1235 Subsection titled "Combining Hamza Above" starting on page 1236 378. 1238 [Unicode70-Stability] 1239 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter 1240 2.2: Unicode Design Principles", Chapter 2, 2014, 1241 . 1243 Subsection titled "Stability" starting on page 23 and 1244 containing a link to http://www.unicode.org/policies/ 1245 stability_policy.html.. 1247 10.2. Informative References 1249 [Dalby] Dalby, A., "Dictionary of Languages: The definitive 1250 reference to more than 400 languages", Columbia Univeristy 1251 Press , 2004. 1253 pages 206-207 1255 [Daniels] Daniels, P. and W. Bright, "The World's Writing Systems", 1256 Oxford University Press , 1986. 1258 [HomographAttack] 1259 Gabrilovich, E. and A. Gontmakher, "The Homograph Attack", 1260 Communications of the ACM 45(2):128, February 2002, 1261 . 1264 [ICANN-VIP] 1265 ICANN, "The IDN Variant Issues Project: A Study of Issues 1266 Related to the Management of IDN Variant TLDs (Integrated 1267 Issues Report)", February 2012, 1268 . 1271 [Omniglot-Fula] 1272 Ager, S., "Omniglot: Fula (Fulfulde, Pulaar, 1273 Pular'Fulaare)", 1274 . 1276 Captured 2015-01-07 1278 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 1279 "Internationalizing Domain Names in Applications (IDNA)", 1280 RFC 3490, March 2003. 1282 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1283 Resource Identifier (URI): Generic Syntax", STD 66, RFC 1284 3986, January 2005. 1286 [RFC4033] Arends, R., Austein, R., Larson, M., Massey, D., and S. 1287 Rose, "DNS Security Introduction and Requirements", RFC 1288 4033, March 2005. 1290 [RFC5564] El-Sherbiny, A., Farah, M., Oueichek, I., and A. Al-Zoman, 1291 "Linguistic Guidelines for the Use of the Arabic Language 1292 in Internet Domains", RFC 5564, February 2010. 1294 [RFC6452] Faltstrom, P. and P. Hoffman, "The Unicode Code Points and 1295 Internationalized Domain Names for Applications (IDNA) - 1296 Unicode 6.0", RFC 6452, November 2011. 1298 [RFC6698] Hoffman, P. and J. Schlyter, "The DNS-Based Authentication 1299 of Named Entities (DANE) Transport Layer Security (TLS) 1300 Protocol: TLSA", RFC 6698, August 2012. 1302 [Unicode32] 1303 The Unicode Consortium, "The Unicode Standard, Version 1304 3.2.0", . 1306 The Unicode Standard, Version 3.2.0 is defined by The 1307 Unicode Standard, Version 3.0 (Reading, MA, Addison- 1308 Wesley, 2000. ISBN 0-201-61633-5), as amended by the 1309 Unicode Standard Annex #27: Unicode 3.1 1310 (http://www.unicode.org/reports/tr27/) and by the Unicode 1311 Standard Annex #28: Unicode 3.2 1312 (http://www.unicode.org/reports/tr28/). 1314 Appendix A. Change Log 1316 RFC Editor: Please remove this appendix before publication. 1318 A.1. Changes from version -00 to -01 1320 o Version 01 of this document is an extensive rewrite and 1321 reorganization, reflecting discussions with UTC members and adding 1322 three more options for discussion to the original proposal to 1323 simply disallow the new code point. 1325 A.2. Changes from version -01 to -02 1327 Corrected a typographical error in which Hamza Above was incorrectly 1328 listed with the wrong code point. 1330 A.3. Changes from version -02 to -03 1332 Corrected a typographical error in the Abstract in which RFC 5892 was 1333 incorrectly shown as 5982. 1335 A.4. Changes from version -03 to -04 1337 o Explicitly identified the applicability of U+08A1 with Fula and 1338 added references that discuss that language and how it is written. 1340 o Updated several Unicode 6.2 references to point to Unicode 7.0 1341 since the latter is now available in stable form (it was done when 1342 work on this I-D started). 1344 o Extensively revised to discuss the non-Arabic cases, non- 1345 decomposing diacritics, other types of characters that don't 1346 compare equal after normalization, and more general problem and 1347 approaches. 1349 Authors' Addresses 1351 John C Klensin 1352 1770 Massachusetts Ave, Ste 322 1353 Cambridge, MA 02140 1354 USA 1356 Phone: +1 617 245 1457 1357 Email: john-ietf@jck.com 1359 Patrik Faltstrom 1360 Netnod 1361 Franzengatan 5 1362 Stockholm 112 51 1363 Sweden 1365 Phone: +46 70 6059051 1366 Email: paf@netnod.se