idnits 2.17.1 draft-klensin-idna-5892upd-unicode70-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 184: '...ated to True for the label, it MUST be...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year (Using the creation date from RFC5982, updated by this document, for RFC5378 checks: 2008-05-13) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 21, 2014) is 3566 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Duplicate reference: RFC5892, mentioned in 'RFC5892', was also mentioned in 'RFC5892Erratum'. ** Downref: Normative reference to an Informational RFC: RFC 6943 -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Exclusion' -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Versioning' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62-Arabic' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62-Hamza' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode7' Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J.C. Klensin 3 Internet-Draft P. Faltstrom 4 Updates: 5982 (if approved) Netnod 5 Intended status: Standards Track July 21, 2014 6 Expires: January 20, 2015 8 IDNA Update for Unicode 7.0.0 9 draft-klensin-idna-5892upd-unicode70-00.txt 11 Abstract 13 The current version of the IDNA specifications anticipated that each 14 new version of Unicode would be reviewed to verify that no changes 15 had been introduced that required adjustments to the set of rules 16 and, in particular, whether new exceptions or backward compatibility 17 adjustments were needed. That review was conducted for Unicode 7.0.0 18 and identified a problematic new code point. This specification 19 updates RFC 5982 to disallow that code point and provides information 20 about the reasons why that exclusion is appropriate. It also applies 21 an editorial clarification that was the subject of an earlier 22 erratum. 24 Status of this Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on January 20, 2015. 41 Copyright Notice 43 Copyright (c) 2014 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents (http://trustee.ietf.org/ 48 license-info) in effect on the date of publication of this document. 49 Please review these documents carefully, as they describe your rights 50 and restrictions with respect to this document. Code Components 51 extracted from this document must include Simplified BSD License text 52 as described in Section 4.e of the Trust Legal Provisions and are 53 provided without warranty as described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2 58 2. Change to RFC 5892 for new character U+08A1 . . . . . . . . . 4 59 3. Editorial clarification to RFC 5892 . . . . . . . . . . . . . 4 60 4. Explanation . . . . . . . . . . . . . . . . . . . . . . . . . 5 61 4.1. A related historical problem . . . . . . . . . . . . . . . 6 62 4.2. How this is being done . . . . . . . . . . . . . . . . . . 7 63 4.2.1. Backward compatibility and normalization . . . . . . . 7 64 4.2.2. A new contextual rule . . . . . . . . . . . . . . . . 7 65 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 8 66 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 67 7. Security Considerations . . . . . . . . . . . . . . . . . . . 8 68 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 9 69 8.1. Normative References . . . . . . . . . . . . . . . . . . . 9 70 8.2. Informative References . . . . . . . . . . . . . . . . . . 10 71 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10 73 1. Introduction 75 The current version of the IDNA specifications, known as "IDNA2008" 76 [RFC5890], anticipated that each new version of Unicode would be 77 reviewed to verify that no changes had been introduced that required 78 adjustments to IDNA's rules and, in particular, whether new 79 exceptions or backward compatibility adjustments were needed. When 80 that review was carefully conducted for Unicode 7.0.0 [Unicode7], 81 comparing it to prior versions including the text in Unicode 6.2 82 [Unicode62], it identified a problematic new code point (U+08A1, 83 ARABIC LETTER BEH WITH HAMZA ABOVE). Section 2 of this specification 84 updates the portion of the IDNA2008 specification that identifies 85 rules for what characters are permitted [RFC5892] to disallow that 86 code point. It also provides information about the reasons why that 87 exclusion is appropriate. 89 As anticipated when IDNA2008, and RFC 5892 in particular, were 90 written, exceptions and explicit updates are likely to be needed only 91 if there is disagreement between the Unicode Consortium's view about 92 what is best for the Standard and the IETF's view of what is best for 93 IDNs, the DNS, and IDNA. It was hoped that a situation would never 94 arise in which the the two perspectives would disagree, but the 95 possibility was anticipated and considerable mechanism added to RFC 96 5890 and 5982 as a result. It is probably important to note that a 97 disagreement in this context does not imply that anyone is "wrong", 98 only that the two different groups have different needs and therefore 99 criteria about what is acceptable. For that reason, the IETF has, in 100 the past, allowed some characters for IDNA that active Unicode 101 Technical Committee members suggested be disallowed to avoid a change 102 in derived tables [RFC6452]. This document describes a case where 103 the IETF should disallow a character that the various properties 104 would otherwise treat as PVALID. 106 This document provides the "flagging for the IESG" specified by 107 Section 5.1 of RFC 5892. As specified there, the change itself 108 requires IETF review because it alters the rules of Section 2 of that 109 document. 111 Readers of this document are expected to be familiar with Unicode 112 terminology [Unicode62] and the IETF conventions for representing 113 Unicode code points [RFC5137]. 115 As a convenience to readers of RFC 5892 and to reduce the risks of 116 confusion, this document also formally applies the content of an 117 erratum to the text of the RFC (see Section 3) and so brings that RFC 118 up to date with all agreed changes. 120 [[RFC Editor: please remove the following comment and note if they 121 get to you.]] 123 [[IESG: It might not be a bad idea to incorporate some version of 124 the following into the Last Call announcement.]] 126 NOTE IN DRAFT to IETF Reviewers: The issues in this document, and 127 particularly the extended discussion below of why this change to 128 RFC 5892 is necessary and appropriate, are fairly esoteric. 129 Understanding them requires that one have at least some 130 understanding of how the Arabic Script works and the reasons the 131 Unicode Standard gives various Arabic Script characters a fairly 132 extended discussion. It also requires understanding of a number 133 of Unicode principles, including the Normalization Stability rules 134 as applied to new precomposed characters and guidelines for adding 135 new characters. References are provided for those who want to 136 pursue them, but potential reviewers should assume that the 137 background needed to understand the reasons for this change is no 138 less deep in the subject matter than would be expected of someone 139 reviewing a proposed change in, e.g., the fundamentals of BGP, TCP 140 congestion control, or some cryptographic algorithm. 142 2. Change to RFC 5892 for new character U+08A1 144 With the publication of this document, Section 2.6 ("Exceptions (F)") 145 of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in 146 Category F so that the rule itself reads: 148 F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660, 149 0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668, 150 0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6, 151 06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 08A1, 0F0B, 152 3007, 302E, 302F, 3031, 3032, 3033, 3034, 3035, 153 303B, 30FB} 155 and then add to the subtable designated 156 "DISALLOWED -- Would otherwise have been PVALID" 157 after the line that begins "07FA", the additional line: 159 08A1; DISALLOWED # ARABIC LETTER BEH WITH HAMZA ABOVE 161 This has the effect of making the cited code point DISALLOWED 162 independent of application of the rest of the IDNA rule set to the 163 current version of Unicode. Those wishing to create domain name 164 labels containing Beh with Hamza Above may continue to use the 165 sequence 167 U+0628, ARABIC LETTER BEH 168 followed by 170 U+0654, ARABIC HAMZA ABOVE 172 which was valid for IDNA purposes in Unicode 5.0 and earlier and 173 which continues to be valid. 175 3. Editorial clarification to RFC 5892 177 Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a 178 clarification to Appendix A and Section A.1 of RFC 5892. This 179 section of this document updates the RFC to apply that clarification. 181 1. In Appendix A, add a new paragraph after the paragraph that 182 begins "The code point...". The new paragraph should read: 184 "For the rule to be evaluated to True for the label, it MUST be 185 evaluated separately for every occurrence of the Code point in the 186 label; each of those evaluations must result in True." 188 2. In Appendix A, Section A.1, replace the "Rule Set" by 190 Rule Set: 191 False; 192 If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True; 193 If cp .eq. \u200C And 194 RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp 195 (Joining_Type:T)*(Joining_Type:{R,D})) Then True; 197 4. Explanation 199 [[NOTE IN DRAFT: Given the nature of this document, we believe this 200 material belongs here. It could, however, be moved to an appendix if 201 anyone felt strongly about that.]] 203 This section summarizes some of the discussions and reasoning that 204 led to the conclusion and change in Section 2. It should not be 205 considered as either normative or authoritative. 207 As the Unicode Standard points out at some length [Unicode62-Arabic], 208 Hamza is a problematic abstract character and the "Hamza Above" 209 construction even more so [Unicode62-Hamza]. Those sections explain 210 a distinction made by Unicode between the use of a Hamza mark to 211 denote a glottal stop and one used as a diacritic mark to denote a 212 separate letter. In the first case, the combining sequence is used. 213 In the second, a precombined character is assigned. 215 Unlike Unicode generally and because of concerns about identifier 216 spoofing and attacks based on similarities, character distinctions in 217 IDNA are based much more strictly on the appearance of characters; 218 pronunciation distinctions are not considered. So, for IDNA, BEH 219 WITH HAMZA ABOVE is not-quite-tautologically the same as BEH WITH 220 HAMZA ABOVE, even if one of them is written as U+08A1 (new to Unicode 221 7.0.0) and the other as the sequence \u'0628'\u'0654' (feasible with 222 Unicode 7.0.0 but also available in versions of Unicode going back at 223 least to the original publication of RFC 5892). Because the two 224 are, for IDNA purposes, the same, IDNA expects that normalization 225 (specifically the requirement that all U-labels be in NFC form) will 226 cause them to compare equal. 228 If Unicode also considered them the same, then the principle would 229 apply that new precomposed ("composition") forms are not added unless 230 one of the code points that could be used to construct it did not 231 exist in an earlier version (and even then is 232 discouraged)[UAX15-Versioning]. When exceptions are made, they are 233 expected to conform to the rules and classes in the "Composition 234 Exclusion Table", with class 2 being relevant to this case 235 [UAX15-Exclusion]. That rule essentially requires that the 236 normalization for the old combining sequence to itself be retained 237 (for stability) but that the newly-added character be treated as 238 canonically decomposable and decompose back to the older sequence 239 even under NFC. That was not done for this particular case, 240 presumably because of the distinction about prounciation modifiers 241 versus separate letters noted above. Because, for IDNA and the DNS, 242 there is a possibility that the composing sequence \u'0628'\u'0654' 243 already appears in labels, the only choice other than allowing an 244 otherwise-identical, and identically-appearing, label with U+08A1 245 substituted to identify a different DNS entry is to DISALLOW the new 246 character. 248 4.1. A related historical problem 250 At least three other grapheme clusters have been present for many 251 version of Unicode and can be seen as involving issues similar to 252 those for the newly-added ARABIC LETTER BEH WITH HAMZA ABOVE. ARABIC 253 LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER REH WITH HAMZA 254 ABOVE (U+076C) do not have decomposition forms and are preferred over 255 combining sequences using HAMZA ABOVE (U+0654) [Unicode62-Hamza]. By 256 contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE (U+0623) decomposes 257 into \u'0627'\u'0653' and ARABIC LETTER YEH WITH HAMZA ABOVE (U+0626) 258 decomposes into \u'064A'\u'0654' so the precomposed character and 259 combining sequences compare equal when both are normalized, as this 260 specification prefers. 262 There are other variations on this theme. For example, ARABIC LETTER 263 U WITH HAMZA ABOVE (U+0677) has a compatibility decomposition into 264 the combining sequence \u'06C7'\u'0674'. 266 Had the issues outlined in this document been better understood at 267 the time, it probably would have been wise for RFC 5892 to disallow 268 either the precomposed character or the combining sequence of each 269 pair unless Unicode normalization rules cause the right thing to 270 happen. Failure to do so at the time places an extra burden on 271 registries to be sure that conflicts (and the potential for confusion 272 and attacks) do not exist. Oddly, had the exclusion been made part 273 of the specification at that time, the preference noted above would 274 probably have dictated excluding the combining sequence, something 275 not otherwise done in IDNA2008. Today, the only thing that can be 276 excluded without the potential disruption of disallowing a 277 previously-PVALID combining sequence is the newly-added code point so 278 whatever is done, or might have been contemplated with hindsight, it 279 would be somewhat inconsistent. 281 4.2. How this is being done 283 Questions have arisen has to why this specification makes the change 284 to RFC 5892 by DISALLOWing U+08A1 as a simple exception (IDNA 285 Category F, RFC 5892 Section 2.7) rather than either a backward- 286 compatibility case (IDNA Category G, RFC 5982 Section 2.8) or 287 modifying IDNA Category F to make Hamza (or Hamza Above, or combining 288 Hamza generally) into CONTEXTO cases and specifying appropriate 289 limitations in a new entry in the IANA IDNA Context Registry (as 290 specified in RFC 5892 Section 5.2). The subsections below explain 291 why neither of those alternatives was chosen despite some discussion 292 of each. 294 4.2.1. Backward compatibility and normalization 296 The "BackwardCompatible" category (IDNA Category G, RFC 5892 Section 297 5.3) is described as applying only when "property values in versions 298 of Unicode after 5.2 have changed in such a way that the derived 299 property value would no longer be PVALID or DISALLOWED". Because 300 U+08A1 is a newly-added code point in Unicode 7.0.0 and no property 301 values of code points in prior versions have changed, that category G 302 does not apply. If that section of RFC 5892 is replaced in the 303 future, perhaps consideration should be given to adding Normalization 304 Stability and other issues to that description but, at present, it is 305 not relevant. 307 4.2.2. A new contextual rule 309 As the Unicode Standard points out at some length [Unicode62-Arabic], 310 Hamza is a problematic abstract character and the "Hamza Above" 311 construction even more so. IDNA has historically associated 312 characters whose use is reasonable in some contexts but not others 313 with the special derived property "CONTEXTO" and then specified 314 specific, context-dependent, rules about where they may be used. 315 Because Hamza Above is problematic (and spawns edge cases, as 316 discussed in the Unicode Standard section cited above), it was 317 suggested that a contextual rule might be appropriate. There are at 318 least two reasons why a contextual rule would not be suitable for the 319 present situation. 321 1. As discussed above, the present situation is a normalization 322 stability and predictability problem, not a contextual one. Had 323 the same issues arisen with a newly-added precomposed character 324 that could previously be constructed from non-problematic base 325 and combining characters, it would be even more clearly a 326 normalization issue and, following the principles discussed there 327 and particularly in UAX 15 [UAX15-Exclusion], might not have been 328 assigned at all. 330 2. The contextual rule sets are designed around restricting the use 331 of code points to a particular script or adjacent to particular 332 characters within that script. Neither of these cases applies to 333 the newly-added character even if one could imagine rules for the 334 use of Hamza Above (U+0654) that would reflect the considerations 335 of Chapter 8 of Unicode 6.2. Even had the latter been desired, 336 it would be somewhat late now -- Hamza Above has been present as 337 a combining character (U+0654) in many versions of Unicode. 338 While that section of the Unicode Standard describes the issues, 339 it does not provide actionable guidance about what to do about it 340 for cases going forward or when visual identity is important. 342 5. Acknowledgements 344 The Unicode 7.0.0 changes were extensively discussed within the IAB's 345 Internationalization Program. The authors are grateful for the 346 discussions and feedback there, especially from Andrew Sullivan and 347 David Thaler. Additional information was requested and received from 348 Mark Davis and Ken Whistler and while they probably do not agree with 349 the necessity of excluding this code point as their responsibility is 350 to look at the Unicode Consortium requirements for stability, the 351 decision would not have been possible without their input. Several 352 experts and reviewers who prefer to remain anonymous also provided 353 helpful input and comments on preliminary versions of this document. 355 6. IANA Considerations 357 When the IANA registry and tables are updated to reflect Unicode 358 7.0.0, code point U+08A1 should be identified as DISALLOWED, 359 consistent with the change made in Section 2. 361 7. Security Considerations 363 This specification excludes a code point for which the Unicode- 364 specified normalization behavior could result in two ways to form a 365 visually-identical character within the same script not comparing 366 equal. That behavior could create a dream case for someone 367 intending to confuse the user by use of a domain name that looked 368 identical to another one, was entirely in the same script, but was 369 still considered different (see, for example, the discussion of false 370 negatives in identifier comparison in Section 2.1 of RFC 6943 371 [RFC6943]). This exclusion therefore should improve Internet 372 security. 374 8. References 376 8.1. Normative References 378 [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", BCP 379 137, RFC 5137, February 2008. 381 [RFC5890] Klensin, J., "Internationalized Domain Names for 382 Applications (IDNA): Definitions and Document Framework", 383 RFC 5890, August 2010. 385 [RFC5892Erratum] 386 "RFC5892, "The Unicode Code Points and Internationalized 387 Domain Names for Applications (IDNA)", August 2010, Errata 388 ID: 3312", Errata ID 3312, August 2012, . 391 [RFC5892] Faltstrom, P., "The Unicode Code Points and 392 Internationalized Domain Names for Applications (IDNA)", 393 RFC 5892, August 2010. 395 [RFC6943] Thaler, D., "Issues in Identifier Comparison for Security 396 Purposes", RFC 6943, May 2013. 398 [UAX15-Exclusion] 399 Davis, M., Ed., "Unicode Standard Annex #15: Unicode 400 Normalization Forms, Section 5", June 2014, . 404 [UAX15-Versioning] 405 Davis, M., Ed., "Unicode Standard Annex #15: Unicode 406 Normalization Forms, Section 3", June 2014, . 409 [Unicode62-Arabic] 410 "The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8", 411 Chapter 8, 2012, . 414 Subsection titled "Encoding Principles", paragraph 415 numbered 4, starting on page 251. 417 [Unicode62-Hamza] 418 "The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8", 419 Chapter 8, 2012, . 422 Subsection titled "Combining Hamza Above" starting on page 423 263. 425 [Unicode62] 426 The Unicode Consortium, "The Unicode Standard, Version 427 6.2.0", ISBN 978-1-936213-07-8, 2012, . 430 Preferred citation: The Unicode Consortium. The Unicode 431 Standard, Version 6.2.0, (Mountain View, CA: The Unicode 432 Consortium, 2012. ISBN 978-1-936213-07-8) 434 [Unicode7] 435 The Unicode Consortium, "The Unicode Standard, Version 436 7.0.0", ISBN 978-1-936213-09-2, 2014, . 439 Preferred Citation: The Unicode Consortium. The Unicode 440 Standard, Version 7.0.0, (Mountain View, CA: The Unicode 441 Consortium, 2014. ISBN 978-1-936213-09-2) 443 8.2. Informative References 445 [RFC6452] Faltstrom, P. and P. Hoffman, "The Unicode Code Points and 446 Internationalized Domain Names for Applications (IDNA) - 447 Unicode 6.0", RFC 6452, November 2011. 449 Authors' Addresses 451 John C Klensin 452 1770 Massachusetts Ave, Ste 322 453 Cambridge, MA 02140 454 USA 456 Phone: +1 617 245 1457 457 Email: john-ietf@jck.com 459 Patrik Faltstrom 460 Netnod 461 Franzengatan 5 462 Stockholm, 112 51 463 Sweden 465 Phone: +46 70 6059051 466 Email: paf@netnod.se