idnits 2.17.1 draft-sullivan-dns-zone-codepoint-pples-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 5, 2012) is 4335 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Sullivan 3 Internet-Draft Dyn, Inc. 4 Intended status: Informational D. Thaler 5 Expires: December 7, 2012 Microsoft 6 O. Kolkman 7 NLnet Labs 8 June 5, 2012 10 Principles for Unicode Code Point Inclusion in Labels in the DNS Root 11 draft-sullivan-dns-zone-codepoint-pples-00 13 Abstract 15 Traditionally, the management of the DNS root zone permitted only 16 "alphabetic" labels. As long as the root zone included only ASCII 17 characters, and as long as there was only one form of a label, the 18 restriction plainly meant that only the letters A-Z and a-z were 19 permitted. The advent of internationalized labels using IDNA2008 20 presents some complications for the restriction. One of the 21 complications is the meaning of the term "alphabetic" when applied to 22 the Unicode code points in U-labels. This memo presents a set of 23 principles that can be used to determine whether a Unicode code point 24 may be wisely included in the repertoire of permissible code points 25 in a U-label in a zone. 27 Status of this Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on December 7, 2012. 44 Copyright Notice 46 Copyright (c) 2012 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Background and Introduction . . . . . . . . . . . . . . . . . . 3 62 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . 4 63 2. Conservatism Principle . . . . . . . . . . . . . . . . . . . . 4 64 3. Inclusion Principle . . . . . . . . . . . . . . . . . . . . . . 4 65 4. Simplicity Principle . . . . . . . . . . . . . . . . . . . . . 4 66 5. Predictability Principle . . . . . . . . . . . . . . . . . . . 5 67 6. Stability Principle . . . . . . . . . . . . . . . . . . . . . . 5 68 7. Letter Principle . . . . . . . . . . . . . . . . . . . . . . . 6 69 8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 6 70 9. Security Considerations . . . . . . . . . . . . . . . . . . . . 6 71 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 7 72 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 7 73 12. Informative References . . . . . . . . . . . . . . . . . . . . 7 74 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 8 76 1. Background and Introduction 78 In recent communications ([IABCOMM1] and [IABCOMM2]), the IAB has 79 emphasized the importance of conservatism in allocating labels 80 conforming to IDNA2008 ([RFC5890], [RFC5891], [RFC5892], [RFC5893], 81 [RFC5894], [RFC5895]) inside the root zone. Traditional LDH-labels 82 (see [RFC5890] for definitions of IDNA terms) in the root zone used 83 only alphabetic characters (i.e., ASCII a-z or A-Z). Matters are 84 more complicated with U-labels, however. The IAB communications 85 recommended that U-labels permit only code points with a 86 General_Category (gc) of Ll (Lowercase_Letter), Lo (Other_Letter), or 87 Lm (Modifier_Letter), but noted that for practical considerations 88 other code points might be permitted on a case-by-case basis. In 89 what follows we will use the Unicode notation; e.g., gc=Ll. 91 The IAB recommendation does, however, present some problems that need 92 to be addressed. First, it is by no means clear that all of the code 93 points with gc=Lo or gc=Lm and which are permitted under IDNA2008 are 94 appropriate for the root zone. To take but one example, the code 95 point U+02BC MODIFIER LETTER APOSTROPHE has gc=Lm. In practically 96 every rendering (we are unaware of an exception), U+02BC is 97 indistinguishable from U+2019 RIGHT SINGLE QUOTATION MARK, which has 98 gc=Pf (Final_Punctuation). U+02BC will also be read by large numbers 99 of people as being the same character as U+0027 APOSTROPHE, which has 100 gc=Po (Other_Punctuation). U+02BC is PROTOCOL VALID (PVALID) under 101 IDNA2008 (see [RFC5892]), whereas both other code points are 102 DISALLOWED. So, to begin with, it is plain that not every code point 103 with gc in {Ll, Lo, Lm} is consistent with any conservatism 104 principle. 106 To make matters worse, some languages are dependent on code points 107 with gc=Mc (Spacing_Mark) or gc=Mn (Nonspacing_Mark). This 108 dependency is particularly common in Indic languages, though not 109 exclusive to them. (At the risk of vastly oversimplifying, the 110 overarching issue is mostly the interaction of complex writing 111 systems and the way Unicode works.) To restrict users of those 112 languages only to code points with gc in {Ll, Lo, Lm} would be 113 extremely limiting. While DNS labels are not words, or sentences, or 114 phrases (as noted in [RFC4690]), they are intended as useful 115 mnemonics. Mnemonics that diverge wildly from the usual conventions 116 in a language are likely to attract strong objections, particularly 117 in the root. The objections might drag the discussion away from 118 sound management of the shared DNS root zone and towards discussions 119 of cultural hegemony. That sort of discussion itself might present 120 risks for the operation of the root zone. 122 For reasons of sound management, it is not desirable to decide 123 whether to permit a given code point only when an application 124 containing that code point is pending. That approach reduces 125 predictability and is bound to appear subject to special pleas. It 126 is better instead to come up with a set of principles for guiding 127 decisions about code points. These principles can then function as 128 meta-rules, determining the rules for inclusion of any code point 129 (from those permitted by IDNA) in labels in the root. The principles 130 might also be adopted by other zones that are shared by much of the 131 Internet. Such a set of principles follows in the sections below. 132 Each section includes remarks on the extent to which the principle 133 could be wisely adopted by zones other than the root. 135 1.1. Terminology 137 Terms relevant to IDNA2008 can be found in [RFC5890]. Other relevant 138 internationalization terms are defined in [RFC6365]. 140 This memo does not propose a protocol standard, and the use of words 141 like "should" follow the ordinary English meaning, and not that laid 142 out in [RFC2119]. 144 2. Conservatism Principle 146 The root zone is, by definition, the one DNS zone that must be shared 147 by everybody. Therefore, any decision to permit a code point in the 148 root zone should be as conservative as practicable. Doubts should 149 always be resolved in favor of rejecting a code point for inclusion 150 rather than in favor of including it, in order to minimize risk. 152 This principle is easily (and wisely) adoptable by any zone. It is 153 also the one that is most likely to yield the safest result. 155 3. Inclusion Principle 157 Just as IDNA2008 starts from the principle that the Unicode range is 158 excluded, and then adds code points according to derived properties 159 of the code points, so the root zone should only permit inclusion of 160 a code point if it is known to be safe. The default treatment of a 161 code point should be that it is excluded. 163 This principle is easily (and wisely) adoptable by any zone. 165 4. Simplicity Principle 167 The rules for determining whether a code point is to be included 168 should be simple enough that they are readily understood by someone 169 with a moderate background in the DNS and Unicode issues. This 170 principle does not mean that a completely naive person needs to be 171 able to understand the rationale for why a code point is included, 172 but it does mean that the reason for inclusion of very peculiar code 173 points, even if the code points are safe in themselves, will be too 174 difficult to understand and will therefore be rejected. 176 The meaning of "simple" or "readily understood" is context dependent. 177 For instance, the root zone has to serve everyone in the world; for 178 practical purposes, this means that the reasons for including a code 179 point need to be comprehensible even to people who cannot use the 180 script where the code point is found. In a zone that permits a very 181 small subset of Unicode characters (for instance, only those needed 182 to write a single language) and that supports a clearly-delineated 183 linguistic community (for instance, the speakers of a single language 184 with well-understood written conventions), more complicated rules 185 might be acceptable. 187 5. Predictability Principle 189 The rules for determining whether a code point is to be included 190 should be predictable enough that those with the requisite 191 understanding of DNS, IDNA, and Unicode would all generally reach the 192 same conclusion. This is not a requirement for algorithmic treatment 193 of code points (the difficulties with the Unicode Letter and Mark 194 categories illustrate why that would be too difficult). It is rather 195 to say that the consistent application of professional judgment is 196 likely to yield the same results; combined with the principle in 197 Section 2, when results are not predictable the anomalous code point 198 would not be included. 200 Just as in Section 4, this principle is not easily extended to zones 201 lower than the root because what is predictable within a given 202 language community is possibly very surprising across languages. 204 6. Stability Principle 206 Once a code point is permitted, it is at least very hard to stop 207 permitting that code point. In general, the list of code points to 208 be permitted should change very slowly, if at all, and usually only 209 in the direction of permitting an addition as time and experience 210 indicates that inclusion of such a code point is both safe and 211 consistent with these principles. 213 This principle likely extends to every delegation-centric domain: if 214 one delegation is permitted to use a code point, it is very hard to 215 see why others might not. 217 7. Letter Principle 219 In keeping with the spirit of the note in [RFC1123] that top-level 220 labels "will be alphabetic", the rules should not include code points 221 that are not normally used to write words, or that are in some cases 222 normally used for purposes other than writing words. This is not the 223 same as using Unicode's General_Category to include only letters. 224 But it is a restriction that expands the possible class of included 225 code points beyond the Unicode letters, but only expands so far as to 226 include the things that are normally used the way letters are. Under 227 this principle, code points with (for example) gc=Mn might be 228 included -- but only those that are used to write words and not (for 229 instance) musical symbols. This principle should be applied as 230 narrowly as possible; as [RFC4690] says, "While DNS labels may 231 conveniently be used to express words in many circumstances, the goal 232 is not to express words (or sentences or phrases), but to permit the 233 creation of unambiguous labels with good mnemonic value." 235 Because the root zone must be shared by everyone, this principle is 236 more important in it than in zones that are intended for use by 237 clearly-defined linguistic communities. 239 8. Conclusion 241 The foregoing principles could be applied generally when considering 242 any range of Unicode code points for possible inclusion in the root 243 zone. It is worth observing that doing anything (especially in light 244 of Section 6) implicitly disadvantages communities with a writing 245 system not yet well understood and not represented in the technical 246 and policy communities involved in the discussion. That disadvantage 247 is to be guarded against as much as practical, but is effectively 248 impossible to prevent (while still taking action) in light of 249 imperfect human knowledge. 251 9. Security Considerations 253 The principles outlined in this memo are partly intended to reduce 254 the possibility of confusion among different labels. While these 255 principles may contribute to reduction of risk, they are not 256 sufficient to provide a comprehensive internationalization policy for 257 zone management. 259 10. IANA Considerations 261 None. RFC Editor: this section may be removed on publication. 263 11. Acknowledgements 265 The authors thank the participants in the IAB Internationalization 266 programme for the discussion of the ideas in this memo. 268 12. Informative References 270 [IABCOMM1] 271 Internet Architecture Board, "IAB Statement: 'The 272 interpretation of rules in the ICANN gTLD Applicant 273 Guidebook.'", February 2012. 275 [IABCOMM2] 276 Internet Architecture Board, "Response to ICANN questions 277 concerning 'The interpretation of rules in the ICANN gTLD 278 Applicant Guidebook'", March 2012. 280 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 281 and Support", STD 3, RFC 1123, October 1989. 283 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 284 Requirement Levels", BCP 14, RFC 2119, March 1997. 286 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and 287 Recommendations for Internationalized Domain Names 288 (IDNs)", RFC 4690, September 2006. 290 [RFC5890] Klensin, J., "Internationalized Domain Names for 291 Applications (IDNA): Definitions and Document Framework", 292 RFC 5890, August 2010. 294 [RFC5891] Klensin, J., "Internationalized Domain Names in 295 Applications (IDNA): Protocol", RFC 5891, August 2010. 297 [RFC5892] Faltstrom, P., "The Unicode Code Points and 298 Internationalized Domain Names for Applications (IDNA)", 299 RFC 5892, August 2010. 301 [RFC5893] Alvestrand, H. and C. Karp, "Right-to-Left Scripts for 302 Internationalized Domain Names for Applications (IDNA)", 303 RFC 5893, August 2010. 305 [RFC5894] Klensin, J., "Internationalized Domain Names for 306 Applications (IDNA): Background, Explanation, and 307 Rationale", RFC 5894, August 2010. 309 [RFC5895] Resnick, P. and P. Hoffman, "Mapping Characters for 310 Internationalized Domain Names in Applications (IDNA) 311 2008", RFC 5895, September 2010. 313 [RFC6365] Hoffman, P. and J. Klensin, "Terminology Used in 314 Internationalization in the IETF", BCP 166, RFC 6365, 315 September 2011. 317 Authors' Addresses 319 Andrew Sullivan 320 Dyn, Inc. 321 150 Dow St 322 Manchester, NH 03101 323 U.S.A. 325 Email: asullivan@dyn.com 327 Dave Thaler 328 Microsoft 329 One Microsoft Way 330 Redmond, WA 98052 331 U.S.A. 333 Email: dthaler@microsoft.com 335 Olaf Kolkman 336 NLnet Labs 337 Science Park 400 338 Amsterdam 1098 XH 339 The Netherlands 341 Email: olaf@NLnetLabs.nl