idnits 2.17.1 draft-iab-dns-zone-codepoint-pples-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (January 30, 2013) is 4104 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Sullivan 3 Internet-Draft Dyn, Inc. 4 Intended status: Informational D. Thaler 5 Expires: August 3, 2013 Microsoft 6 J. Klensin 8 O. Kolkman 9 NLnet Labs 10 January 30, 2013 12 Principles for Unicode Code Point Inclusion in Labels in the DNS 13 draft-iab-dns-zone-codepoint-pples-02 15 Abstract 17 IDNA makes available to DNS zone administrators a very wide range of 18 Unicode code points. Most operators of zones should probably not 19 permit registration of U-labels using the entire range. This is 20 especially true of zones that accept registrations across 21 organizational boundaries, such as top-level domains and, most 22 importantly, the root. It is unfortunately not possible to generate 23 algorithms to determine whether permitting a code point presents a 24 low risk. This memo presents a set of principles that can be used to 25 guide the decision of whether a Unicode code point may be wisely 26 included in the repertoire of permissible code points in a U-label in 27 a zone. 29 Status of this Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at http://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on August 3, 2013. 46 Copyright Notice 48 Copyright (c) 2013 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (http://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 64 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 65 2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . 4 66 2.1. More-Restrictive Rules Going Up the DNS Tree . . . . . . . 5 67 3. Principles Applicable to All Zones . . . . . . . . . . . . . . 6 68 3.1. Longevity Principle . . . . . . . . . . . . . . . . . . . 6 69 3.2. Least Astonishment Principle . . . . . . . . . . . . . . . 6 70 3.3. Contextual Safety Principle . . . . . . . . . . . . . . . 6 71 4. Principles Applicable to All Public Zones . . . . . . . . . . 7 72 4.1. Conservatism Principle . . . . . . . . . . . . . . . . . . 7 73 4.2. Inclusion Principle . . . . . . . . . . . . . . . . . . . 7 74 4.3. Simplicity Principle . . . . . . . . . . . . . . . . . . . 7 75 4.4. Predictability Principle . . . . . . . . . . . . . . . . . 8 76 4.5. Stability Principle . . . . . . . . . . . . . . . . . . . 8 77 5. Principle Specific to the Root Zone . . . . . . . . . . . . . 8 78 5.1. Letter Principle . . . . . . . . . . . . . . . . . . . . . 8 79 6. Confusion and Context . . . . . . . . . . . . . . . . . . . . 9 80 7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 9 81 8. Security Considerations . . . . . . . . . . . . . . . . . . . 9 82 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 83 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10 84 11. IAB Members at the Time of This Writing . . . . . . . . . . . 10 85 12. Informative References . . . . . . . . . . . . . . . . . . . . 10 86 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 11 88 1. Introduction 90 Operators of a DNS zone need to set policies around what Unicode code 91 points are allowed in labels in that zone. Typically there are a 92 number of important goals to consider when constructing such 93 policies. These include, for instance, avoiding possible visual 94 confusability between two labels, avoiding possible confusion between 95 Fully-Qualified Domain Names (FQDNs) and IP address literals, 96 accessibility to the disabled (see [WCAG20] for some discussion in a 97 web context), and other usability issues. 99 This document provides a set of principles that zone operators can 100 use to construct their code point policies in order to improve 101 usability and clarity and thereby reduce confusion. 103 1.1. Terminology 105 This document uses the following terms. 107 A-label: an LDH label that starts with "xn--" and meets all the 108 IDNA requirements, with additional restrictions as explained in 109 Section 2.3.2.1 of [RFC5890]. 111 Character: a member of a set of elements used for the 112 organization, control, or representation of data. See Section 2 113 of [RFC6365] for more details. 115 Language: a way that humans communicate. The use of language 116 occurs in many forms, the most common of which are speech, 117 writing, and signing. See Section 2 of [RFC6365] for more 118 details. 120 LDH Label: a string consisting of ASCII letters, digits, and the 121 hyphen, with additional restrictions as explained in Section 2.3.1 122 of [RFC5890]. 124 Public zone: in this document, a DNS zone that accepts 125 registration requests from organizations outside the zone 126 administrator's own organization. (Whether the zone performs 127 delegation is a separate question. What is important is the 128 diversity of the registration-requesting community.) Note that 129 under this definition, the root zone is a public zone, though one 130 that has a unique function in the DNS. 132 Rendering: the display of a string of text. See Section 5 of 133 [RFC6365] for more details. 135 Script: a set of graphic characters used for the written form of 136 one or more languages. See Section 2 of [RFC6365] for more 137 details. 139 U-label: a string of Unicode characters that meets all the IDNA 140 requirements and includes at least one non-ASCII character, with 141 additional restrictions as explained in Section 2.3.2.1 of 142 [RFC5890]. 144 Writing system: a set of rules for using one or more scripts to 145 write a particular language. See Section 2 of [RFC6365] for more 146 details. 148 This memo does not propose a protocol standard, and the use of words 149 such as "should" follow the ordinary English meaning, and not that 150 laid out in [RFC2119]. 152 2. Background 154 In recent communications ([IABCOMM1] and [IABCOMM2]), the IAB has 155 emphasized the importance of conservatism in allocating labels 156 conforming to IDNA2008 ([RFC5890], [RFC5891], [RFC5892], [RFC5893], 157 [RFC5894], [RFC5895]) in DNS zones, and especially in the root zone. 158 Traditional LDH-labels in the root zone used only alphabetic 159 characters (i.e., ASCII a-z or A-Z). Matters are more complicated 160 with U-labels, however. The IAB communications recommended that 161 U-labels permit only code points with a General_Category (gc) of Ll 162 (Lowercase_Letter), Lo (Other_Letter), or Lm (Modifier_Letter), but 163 noted that for practical considerations other code points might be 164 permitted on a case-by-case basis. 166 The IAB recommendations do, however, leave some issues open that need 167 to be addressed. First, it is by no means clear that all of the code 168 points with General_Category Lo or Lm and which are permitted under 169 IDNA2008 are appropriate for a zone such as the root zone. To take 170 but one example, the code point U+02BC MODIFIER LETTER APOSTROPHE has 171 a General_Category of Lm. In practically every rendering (and we are 172 unaware of an exception), U+02BC is indistinguishable from U+2019 173 RIGHT SINGLE QUOTATION MARK, which has a General_Category of Pf 174 (Final_Punctuation). U+02BC will also be read by large numbers of 175 people as being the same character as U+0027 APOSTROPHE, which has a 176 General_Category of Po (Other_Punctuation), and some computer systems 177 may treat U+02BC as U+0027. U+02BC is PROTOCOL VALID (PVALID) under 178 IDNA2008 (see [RFC5892]), whereas both other code points are 179 DISALLOWED. So, to begin with, it is plain that not every code point 180 with a General_Category of Ll, Lo, or Lm is consistent with the type 181 of conservatism principle discussed in Section 4.1 or the IAB 182 recommendation. 184 To make matters worse, some languages are dependent on code points 185 with General_Category Mc (Spacing_Mark) or General_Category Mn 186 (Nonspacing_Mark). This dependency is particularly common in Indic 187 languages, though not exclusive to them. (At the risk of vastly 188 oversimplifying, the overarching issue is mostly the interaction of 189 complex writing systems and the way Unicode works.) To restrict 190 users of those languages only to code points with General_Category of 191 Ll, Lo, or Lm would be extremely limiting. While DNS labels are not 192 words, or sentences, or phrases (as noted in [RFC4690]), they are 193 intended to support useful mnemonics. Mnemonics that diverge wildly 194 from the usual conventions are poor ones, because in not following 195 the usual conventions they are not easy to remember. Also, wide 196 divergence from usual conventions, if not well-justified (and 197 especially in a shared namespace like the root) invites political 198 controversy. 200 Many of the issues above turn out to be relevant to all public zones. 201 Moreover, the overall issue of developing a policy for code point 202 permission is common to all zones that accept A-labels or U-labels 203 for registration. As section 4.2.4 of [RFC5891] says, every registry 204 at every level of the DNS is "expected to establish policies about 205 label registrations." 207 For reasons of sound management, it is not desirable to decide 208 whether to permit a given code point only when an application 209 containing that code point is pending. That approach reduces 210 predictability and is bound to appear subject to special pleas. It 211 is better instead to come up with the rules governing acceptance of 212 code points in advance. 214 As is evident from the foregoing discussion about the Letter and Mark 215 categories, it is simply not possible to make code point decisions 216 algorithmically. If it were possible to develop such an algorithm, 217 it would already exist: the DNS is hardly unique in needing to impose 218 restrictions on code points while accommodating many different 219 linguistic communities. Nevertheless, new guidelines can be made by 220 starting from overarching principles. These guidelines act more as 221 meta-rules, leading to the establishment of other rules about the 222 inclusion and exclusion of particular code points in labels in a 223 given zone, always based on the list of code points permitted by 224 IDNA. 226 2.1. More-Restrictive Rules Going Up the DNS Tree 228 A set of principles derived from the above ideas follows in Section 3 229 through Section 5 below. Such principles fall into three categories. 231 Some principles apply to every DNS zone. Some additional principles 232 apply to all public zones, including the root zone. Finally, other 233 principles apply only to the root zone. This means that zones higher 234 in the DNS tree tend to have more restrictive rules (since additional 235 principles apply), and zones lower in the DNS tree tend to have less 236 restrictive rules, since they are used within a more narrow context. 237 In general, the relevant context for a principle is that of the zone, 238 not that of a given subset of the user community; for the root zone, 239 for example, the context is "the entire Internet population". 241 3. Principles Applicable to All Zones 243 3.1. Longevity Principle 245 Unicode properties of a code point ought to be stable across the 246 versions of Unicode that users of the zone are likely to have 247 installed. Because it is possible for the properties of a code point 248 to change between Unicode versions, a good way to predict such 249 stability is to ensure that a code point has in fact been stable for 250 multiple successive versions of Unicode. This principle is related 251 to the Stability Principle in Section 4.5. 253 The more diverse the community using the zone, the greater the 254 importance of following this principle. The policy for a leaf zone 255 in the DNS might only require stability across two Unicode versions, 256 whereas a more public zone might require stability across four or 257 more releases before the code point's properties are considered long- 258 lived and stable. 260 3.2. Least Astonishment Principle 262 Every zone administrator should be sensitive to the likely use of a 263 code point to be permitted, particularly taking into account the 264 population likely to use the zone. Zone administrators should 265 especially consider whether a candidate code point could present 266 difficulty if the code point is encountered outside the usual 267 linguistic circumstances. By the same token, the failure to support 268 a code point that is normal in some linguistic circumstances could be 269 very surprising for users likely to encounter the names in that 270 circumstance. 272 3.3. Contextual Safety Principle 274 Every zone administrator should be sensitive to ways in which a code 275 point that is permitted could be used in support of malicious 276 activity. This is not a completely new problem: the digit 1 and the 277 lower-case letter l are, for instance, easily confused in many 278 contexts. The very large repertoire of code points in Unicode (even 279 just the subset permitted for IDNs) makes the problem somewhat worse, 280 just because of the scale. 282 4. Principles Applicable to All Public Zones 284 4.1. Conservatism Principle 286 Public zones are, by definition, zones that are shared by different 287 groups of people. Therefore, any decision to permit a code point in 288 a public zone (including the root) should be as conservative as 289 practicable. Doubts should always be resolved in favor of rejecting 290 a code point for inclusion rather than in favor of including it, in 291 order to minimize risk. 293 4.2. Inclusion Principle 295 Just as IDNA2008 starts from the principle that the Unicode range is 296 excluded, and then adds code points according to derived properties 297 of the code points, so a public zone should only permit inclusion of 298 a code point if it is known to be "safe" in terms of usability and 299 confusability within the context of that zone. The default treatment 300 of a code point should be that it is excluded. 302 4.3. Simplicity Principle 304 The rules for determining whether a code point is to be included 305 should be simple enough that they are readily understood by someone 306 with a moderate background in the DNS and Unicode issues. This 307 principle does not mean that a completely naive person needs to be 308 able to understand the rationale for why a code point is included, 309 but it does mean that the reason for inclusion of very peculiar code 310 points, even if the code points are safe in themselves, will be too 311 difficult to understand and such code points will therefore be 312 rejected. 314 The meaning of "simple" or "readily understood" is context-dependent. 315 For instance, the root zone has to serve everyone in the world; for 316 practical purposes, this means that the reasons for including a code 317 point need to be comprehensible even to people who cannot use the 318 script where the code point is found. In a zone that permits a 319 constrained subset of Unicode characters (for instance, only those 320 needed to write a single alphabetic language) and that supports a 321 clearly-delineated linguistic community (for instance, the speakers 322 of a single language with well-understood written conventions), more 323 complicated rules might be acceptable. Compare this principle with 324 the Least Astonishment Principle in Section 3.2. 326 4.4. Predictability Principle 328 The rules for determining whether a code point is to be included 329 should be predictable enough that those with the requisite 330 understanding of DNS, IDNA, and Unicode will usually reach the same 331 conclusion. This is not a requirement for algorithmic treatment of 332 code points; as previously noted, that is not possible. It is rather 333 to say that the consistent application of professional judgment is 334 likely to yield the same results; combined with the principle in 335 Section 4.1, when results are not predictable the anomalous code 336 point would not be permitted. 338 Just as in Section 4.3, this principle tends to cause more 339 restriction the more diverse the community using the zone; it is most 340 restrictive for the root zone. This is because what is predictable 341 within a given language community is possibly very surprising across 342 languages. 344 4.5. Stability Principle 346 Once a code point is permitted, it is at least very hard to stop 347 permitting that code point. In public zones (including the root), 348 the list of code points to be permitted should change very slowly, if 349 at all, and usually only in the direction of permitting an addition 350 as time and experience indicates that inclusion of such a code point 351 is both safe and consistent with these principles. 353 5. Principle Specific to the Root Zone 355 5.1. Letter Principle 357 There is a note in [RFC1123] that top-level labels "will be 358 alphabetic". In the absence of widespread agreement about the force 359 of that note, prudence suggests that U-labels in the root zone should 360 exclude code points that are not normally used to write words, or 361 that are in some cases normally used for purposes other than writing 362 words. This is not the same as using Unicode's General_Category to 363 include only letters. It is a restriction that expands the possible 364 class of included code points beyond the Unicode letters, but only 365 expands so far as to include the things that are normally used to 366 create words. Under this principle, code points with (for example) 367 General_Category Mn (Nonspacing_Mark) might be included -- but only 368 those that are used to write words and not (for instance) musical 369 symbols. In addition, such marks should only be used within a label 370 in ways that they would be used when making a word: combinations that 371 would be nonsense when used in a word should also be rejected when 372 tried in DNS labels. This principle should be applied as narrowly as 373 possible; as [RFC4690] says, "While DNS labels may conveniently be 374 used to express words in many circumstances, the goal is not to 375 express words (or sentences or phrases), but to permit the creation 376 of unambiguous labels with good mnemonic value." 378 6. Confusion and Context 380 While many discussions of confusion have focused on characters, e.g., 381 whether two characters are confusable with each other (and under what 382 circumstances), a focus on characters alone could lead to the 383 prohibition of very large numbers of labels, including many that 384 present little risk. Instead, the focus should be on whether one 385 label is confusable with another. For example, if a label contains 386 several characters that are distinct to a particular script, and all 387 of its characters are from that script, it is inherently not 388 confusable with a label from any other script no matter what other 389 characters might appear in it. Another label that lacks those 390 distinguishing characters might be a problem. The notion extends 391 from labels to domain names, in the sense that distinguishing 392 characters used in a higher-level label may set expectations with 393 respect to the characters in the lower level labels. This 394 expectation might be regarded as a benefit, but it is also a problem, 395 since there is no technical way to require consistent policies in 396 delegated name spaces. 398 7. Conclusion 400 The principles outlined in this document can be applied when 401 considering any range of Unicode code points for possible inclusion 402 in a DNS zone. It is worth observing that doing anything (especially 403 in light of Section 4.5) implicitly disadvantages communities with a 404 writing system not yet well understood and not represented in the 405 technical and policy communities involved in the discussion. That 406 disadvantage is to be guarded against as much as practical, but is 407 effectively impossible to prevent (while still taking action) in 408 light of imperfect human knowledge. 410 8. Security Considerations 412 The principles outlined in this memo are intended to improve 413 usability and clarity and thereby reduce confusion among different 414 labels. While these principles may contribute to reduction of risk, 415 they are not sufficient to provide a comprehensive 416 internationalization policy for zone management. 418 Additional discussion of Unicode security considerations can be found 419 in [UTR36]. 421 9. IANA Considerations 423 None. RFC Editor: this section may be removed on publication. 425 10. Acknowledgements 427 The authors thank the participants in the IAB Internationalization 428 program for the discussion of the ideas in this memo, particularly 429 Marc Blanchet. In addition, Stephane Bortzmeyer, Paul Hoffman, 430 Daniel Kalchev, Panagiotis Papaspiliopoulos, and Vaggelis Segredakis, 431 made specific comments. 433 11. IAB Members at the Time of This Writing 435 Bernard Aboba 436 Jari Arkko 437 Marc Blanchet 438 Ross Callon 439 Alissa Cooper 440 Spencer Dawkins 441 Joel Halpern 442 Russ Housley 443 David Kessens 444 Danny McPherson 445 Jon Peterson 446 Dave Thaler 447 Hannes Tschofenig 449 12. Informative References 451 [IABCOMM1] 452 Internet Architecture Board, "IAB Statement: 'The 453 interpretation of rules in the ICANN gTLD Applicant 454 Guidebook.'", February 2012. 456 [IABCOMM2] 457 Internet Architecture Board, "Response to ICANN questions 458 concerning 'The interpretation of rules in the ICANN gTLD 459 Applicant Guidebook'", March 2012. 461 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 462 and Support", STD 3, RFC 1123, October 1989. 464 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 465 Requirement Levels", BCP 14, RFC 2119, March 1997. 467 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and 468 Recommendations for Internationalized Domain Names 469 (IDNs)", RFC 4690, September 2006. 471 [RFC5890] Klensin, J., "Internationalized Domain Names for 472 Applications (IDNA): Definitions and Document Framework", 473 RFC 5890, August 2010. 475 [RFC5891] Klensin, J., "Internationalized Domain Names in 476 Applications (IDNA): Protocol", RFC 5891, August 2010. 478 [RFC5892] Faltstrom, P., "The Unicode Code Points and 479 Internationalized Domain Names for Applications (IDNA)", 480 RFC 5892, August 2010. 482 [RFC5893] Alvestrand, H. and C. Karp, "Right-to-Left Scripts for 483 Internationalized Domain Names for Applications (IDNA)", 484 RFC 5893, August 2010. 486 [RFC5894] Klensin, J., "Internationalized Domain Names for 487 Applications (IDNA): Background, Explanation, and 488 Rationale", RFC 5894, August 2010. 490 [RFC5895] Resnick, P. and P. Hoffman, "Mapping Characters for 491 Internationalized Domain Names in Applications (IDNA) 492 2008", RFC 5895, September 2010. 494 [RFC6365] Hoffman, P. and J. Klensin, "Terminology Used in 495 Internationalization in the IETF", BCP 166, RFC 6365, 496 September 2011. 498 [UTR36] Davis, M. and M. Suignard, "Unicode Security 499 Considerations", Unicode Technical Report #36, July 2012. 501 [WCAG20] "Web Content Accessibility Guidelines (WCAG) 2.0", 502 December 2008. 504 Authors' Addresses 506 Andrew Sullivan 507 Dyn, Inc. 508 150 Dow St 509 Manchester, NH 03101 510 U.S.A. 512 Email: asullivan@dyn.com 514 Dave Thaler 515 Microsoft 516 One Microsoft Way 517 Redmond, WA 98052 518 U.S.A. 520 Email: dthaler@microsoft.com 522 John C Klensin 523 1770 Massachusetts Ave, Ste 322 524 Cambridge, MA 02140 525 USA 527 Phone: +1 617 491 5735 528 Email: john-ietf@jck.com 530 Olaf Kolkman 531 NLnet Labs 532 Science Park 400 533 Amsterdam 1098 XH 534 The Netherlands 536 Email: olaf@NLnetLabs.nl