idnits 2.17.1 

draft-iab-dns-zone-codepoint-pples-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (January 30, 2013) is 4104 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network Working Group                                        A. Sullivan
3	Internet-Draft                                                 Dyn, Inc.
4	Intended status: Informational                                 D. Thaler
5	Expires: August 3, 2013                                        Microsoft
6	                                                              J. Klensin

8	                                                              O. Kolkman
9	                                                              NLnet Labs
10	                                                        January 30, 2013

12	    Principles for Unicode Code Point Inclusion in Labels in the DNS
13	                 draft-iab-dns-zone-codepoint-pples-02

15	Abstract

17	   IDNA makes available to DNS zone administrators a very wide range of
18	   Unicode code points.  Most operators of zones should probably not
19	   permit registration of U-labels using the entire range.  This is
20	   especially true of zones that accept registrations across
21	   organizational boundaries, such as top-level domains and, most
22	   importantly, the root.  It is unfortunately not possible to generate
23	   algorithms to determine whether permitting a code point presents a
24	   low risk.  This memo presents a set of principles that can be used to
25	   guide the decision of whether a Unicode code point may be wisely
26	   included in the repertoire of permissible code points in a U-label in
27	   a zone.

29	Status of this Memo

31	   This Internet-Draft is submitted in full conformance with the
32	   provisions of BCP 78 and BCP 79.

34	   Internet-Drafts are working documents of the Internet Engineering
35	   Task Force (IETF).  Note that other groups may also distribute
36	   working documents as Internet-Drafts.  The list of current Internet-
37	   Drafts is at http://datatracker.ietf.org/drafts/current/.

39	   Internet-Drafts are draft documents valid for a maximum of six months
40	   and may be updated, replaced, or obsoleted by other documents at any
41	   time.  It is inappropriate to use Internet-Drafts as reference
42	   material or to cite them other than as "work in progress."

44	   This Internet-Draft will expire on August 3, 2013.

46	Copyright Notice

48	   Copyright (c) 2013 IETF Trust and the persons identified as the
49	   document authors.  All rights reserved.

51	   This document is subject to BCP 78 and the IETF Trust's Legal
52	   Provisions Relating to IETF Documents
53	   (http://trustee.ietf.org/license-info) in effect on the date of
54	   publication of this document.  Please review these documents
55	   carefully, as they describe your rights and restrictions with respect
56	   to this document.  Code Components extracted from this document must
57	   include Simplified BSD License text as described in Section 4.e of
58	   the Trust Legal Provisions and are provided without warranty as
59	   described in the Simplified BSD License.

61	Table of Contents

63	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
64	     1.1.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  3
65	   2.  Background . . . . . . . . . . . . . . . . . . . . . . . . . .  4
66	     2.1.  More-Restrictive Rules Going Up the DNS Tree . . . . . . .  5
67	   3.  Principles Applicable to All Zones . . . . . . . . . . . . . .  6
68	     3.1.  Longevity Principle  . . . . . . . . . . . . . . . . . . .  6
69	     3.2.  Least Astonishment Principle . . . . . . . . . . . . . . .  6
70	     3.3.  Contextual Safety Principle  . . . . . . . . . . . . . . .  6
71	   4.  Principles Applicable to All Public Zones  . . . . . . . . . .  7
72	     4.1.  Conservatism Principle . . . . . . . . . . . . . . . . . .  7
73	     4.2.  Inclusion Principle  . . . . . . . . . . . . . . . . . . .  7
74	     4.3.  Simplicity Principle . . . . . . . . . . . . . . . . . . .  7
75	     4.4.  Predictability Principle . . . . . . . . . . . . . . . . .  8
76	     4.5.  Stability Principle  . . . . . . . . . . . . . . . . . . .  8
77	   5.  Principle Specific to the Root Zone  . . . . . . . . . . . . .  8
78	     5.1.  Letter Principle . . . . . . . . . . . . . . . . . . . . .  8
79	   6.  Confusion and Context  . . . . . . . . . . . . . . . . . . . .  9
80	   7.  Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . .  9
81	   8.  Security Considerations  . . . . . . . . . . . . . . . . . . .  9
82	   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 10
83	   10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10
84	   11. IAB Members at the Time of This Writing  . . . . . . . . . . . 10
85	   12. Informative References . . . . . . . . . . . . . . . . . . . . 10
86	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 11

88	1.  Introduction

90	   Operators of a DNS zone need to set policies around what Unicode code
91	   points are allowed in labels in that zone.  Typically there are a
92	   number of important goals to consider when constructing such
93	   policies.  These include, for instance, avoiding possible visual
94	   confusability between two labels, avoiding possible confusion between
95	   Fully-Qualified Domain Names (FQDNs) and IP address literals,
96	   accessibility to the disabled (see [WCAG20] for some discussion in a
97	   web context), and other usability issues.

99	   This document provides a set of principles that zone operators can
100	   use to construct their code point policies in order to improve
101	   usability and clarity and thereby reduce confusion.

103	1.1.  Terminology

105	   This document uses the following terms.

107	      A-label: an LDH label that starts with "xn--" and meets all the
108	      IDNA requirements, with additional restrictions as explained in
109	      Section 2.3.2.1 of [RFC5890].

111	      Character: a member of a set of elements used for the
112	      organization, control, or representation of data.  See Section 2
113	      of [RFC6365] for more details.

115	      Language: a way that humans communicate.  The use of language
116	      occurs in many forms, the most common of which are speech,
117	      writing, and signing.  See Section 2 of [RFC6365] for more
118	      details.

120	      LDH Label: a string consisting of ASCII letters, digits, and the
121	      hyphen, with additional restrictions as explained in Section 2.3.1
122	      of [RFC5890].

124	      Public zone: in this document, a DNS zone that accepts
125	      registration requests from organizations outside the zone
126	      administrator's own organization.  (Whether the zone performs
127	      delegation is a separate question.  What is important is the
128	      diversity of the registration-requesting community.)  Note that
129	      under this definition, the root zone is a public zone, though one
130	      that has a unique function in the DNS.

132	      Rendering: the display of a string of text.  See Section 5 of
133	      [RFC6365] for more details.

135	      Script: a set of graphic characters used for the written form of
136	      one or more languages.  See Section 2 of [RFC6365] for more
137	      details.

139	      U-label: a string of Unicode characters that meets all the IDNA
140	      requirements and includes at least one non-ASCII character, with
141	      additional restrictions as explained in Section 2.3.2.1 of
142	      [RFC5890].

144	      Writing system: a set of rules for using one or more scripts to
145	      write a particular language.  See Section 2 of [RFC6365] for more
146	      details.

148	   This memo does not propose a protocol standard, and the use of words
149	   such as "should" follow the ordinary English meaning, and not that
150	   laid out in [RFC2119].

152	2.  Background

154	   In recent communications ([IABCOMM1] and [IABCOMM2]), the IAB has
155	   emphasized the importance of conservatism in allocating labels
156	   conforming to IDNA2008 ([RFC5890], [RFC5891], [RFC5892], [RFC5893],
157	   [RFC5894], [RFC5895]) in DNS zones, and especially in the root zone.
158	   Traditional LDH-labels in the root zone used only alphabetic
159	   characters (i.e., ASCII a-z or A-Z).  Matters are more complicated
160	   with U-labels, however.  The IAB communications recommended that
161	   U-labels permit only code points with a General_Category (gc) of Ll
162	   (Lowercase_Letter), Lo (Other_Letter), or Lm (Modifier_Letter), but
163	   noted that for practical considerations other code points might be
164	   permitted on a case-by-case basis.

166	   The IAB recommendations do, however, leave some issues open that need
167	   to be addressed.  First, it is by no means clear that all of the code
168	   points with General_Category Lo or Lm and which are permitted under
169	   IDNA2008 are appropriate for a zone such as the root zone.  To take
170	   but one example, the code point U+02BC MODIFIER LETTER APOSTROPHE has
171	   a General_Category of Lm.  In practically every rendering (and we are
172	   unaware of an exception), U+02BC is indistinguishable from U+2019
173	   RIGHT SINGLE QUOTATION MARK, which has a General_Category of Pf
174	   (Final_Punctuation).  U+02BC will also be read by large numbers of
175	   people as being the same character as U+0027 APOSTROPHE, which has a
176	   General_Category of Po (Other_Punctuation), and some computer systems
177	   may treat U+02BC as U+0027.  U+02BC is PROTOCOL VALID (PVALID) under
178	   IDNA2008 (see [RFC5892]), whereas both other code points are
179	   DISALLOWED.  So, to begin with, it is plain that not every code point
180	   with a General_Category of Ll, Lo, or Lm is consistent with the type
181	   of conservatism principle discussed in Section 4.1 or the IAB
182	   recommendation.

184	   To make matters worse, some languages are dependent on code points
185	   with General_Category Mc (Spacing_Mark) or General_Category Mn
186	   (Nonspacing_Mark).  This dependency is particularly common in Indic
187	   languages, though not exclusive to them.  (At the risk of vastly
188	   oversimplifying, the overarching issue is mostly the interaction of
189	   complex writing systems and the way Unicode works.)  To restrict
190	   users of those languages only to code points with General_Category of
191	   Ll, Lo, or Lm would be extremely limiting.  While DNS labels are not
192	   words, or sentences, or phrases (as noted in [RFC4690]), they are
193	   intended to support useful mnemonics.  Mnemonics that diverge wildly
194	   from the usual conventions are poor ones, because in not following
195	   the usual conventions they are not easy to remember.  Also, wide
196	   divergence from usual conventions, if not well-justified (and
197	   especially in a shared namespace like the root) invites political
198	   controversy.

200	   Many of the issues above turn out to be relevant to all public zones.
201	   Moreover, the overall issue of developing a policy for code point
202	   permission is common to all zones that accept A-labels or U-labels
203	   for registration.  As section 4.2.4 of [RFC5891] says, every registry
204	   at every level of the DNS is "expected to establish policies about
205	   label registrations."

207	   For reasons of sound management, it is not desirable to decide
208	   whether to permit a given code point only when an application
209	   containing that code point is pending.  That approach reduces
210	   predictability and is bound to appear subject to special pleas.  It
211	   is better instead to come up with the rules governing acceptance of
212	   code points in advance.

214	   As is evident from the foregoing discussion about the Letter and Mark
215	   categories, it is simply not possible to make code point decisions
216	   algorithmically.  If it were possible to develop such an algorithm,
217	   it would already exist: the DNS is hardly unique in needing to impose
218	   restrictions on code points while accommodating many different
219	   linguistic communities.  Nevertheless, new guidelines can be made by
220	   starting from overarching principles.  These guidelines act more as
221	   meta-rules, leading to the establishment of other rules about the
222	   inclusion and exclusion of particular code points in labels in a
223	   given zone, always based on the list of code points permitted by
224	   IDNA.

226	2.1.  More-Restrictive Rules Going Up the DNS Tree

228	   A set of principles derived from the above ideas follows in Section 3
229	   through Section 5 below.  Such principles fall into three categories.

231	   Some principles apply to every DNS zone.  Some additional principles
232	   apply to all public zones, including the root zone.  Finally, other
233	   principles apply only to the root zone.  This means that zones higher
234	   in the DNS tree tend to have more restrictive rules (since additional
235	   principles apply), and zones lower in the DNS tree tend to have less
236	   restrictive rules, since they are used within a more narrow context.
237	   In general, the relevant context for a principle is that of the zone,
238	   not that of a given subset of the user community; for the root zone,
239	   for example, the context is "the entire Internet population".

241	3.  Principles Applicable to All Zones

243	3.1.  Longevity Principle

245	   Unicode properties of a code point ought to be stable across the
246	   versions of Unicode that users of the zone are likely to have
247	   installed.  Because it is possible for the properties of a code point
248	   to change between Unicode versions, a good way to predict such
249	   stability is to ensure that a code point has in fact been stable for
250	   multiple successive versions of Unicode.  This principle is related
251	   to the Stability Principle in Section 4.5.

253	   The more diverse the community using the zone, the greater the
254	   importance of following this principle.  The policy for a leaf zone
255	   in the DNS might only require stability across two Unicode versions,
256	   whereas a more public zone might require stability across four or
257	   more releases before the code point's properties are considered long-
258	   lived and stable.

260	3.2.  Least Astonishment Principle

262	   Every zone administrator should be sensitive to the likely use of a
263	   code point to be permitted, particularly taking into account the
264	   population likely to use the zone.  Zone administrators should
265	   especially consider whether a candidate code point could present
266	   difficulty if the code point is encountered outside the usual
267	   linguistic circumstances.  By the same token, the failure to support
268	   a code point that is normal in some linguistic circumstances could be
269	   very surprising for users likely to encounter the names in that
270	   circumstance.

272	3.3.  Contextual Safety Principle

274	   Every zone administrator should be sensitive to ways in which a code
275	   point that is permitted could be used in support of malicious
276	   activity.  This is not a completely new problem: the digit 1 and the
277	   lower-case letter l are, for instance, easily confused in many
278	   contexts.  The very large repertoire of code points in Unicode (even
279	   just the subset permitted for IDNs) makes the problem somewhat worse,
280	   just because of the scale.

282	4.  Principles Applicable to All Public Zones

284	4.1.  Conservatism Principle

286	   Public zones are, by definition, zones that are shared by different
287	   groups of people.  Therefore, any decision to permit a code point in
288	   a public zone (including the root) should be as conservative as
289	   practicable.  Doubts should always be resolved in favor of rejecting
290	   a code point for inclusion rather than in favor of including it, in
291	   order to minimize risk.

293	4.2.  Inclusion Principle

295	   Just as IDNA2008 starts from the principle that the Unicode range is
296	   excluded, and then adds code points according to derived properties
297	   of the code points, so a public zone should only permit inclusion of
298	   a code point if it is known to be "safe" in terms of usability and
299	   confusability within the context of that zone.  The default treatment
300	   of a code point should be that it is excluded.

302	4.3.  Simplicity Principle

304	   The rules for determining whether a code point is to be included
305	   should be simple enough that they are readily understood by someone
306	   with a moderate background in the DNS and Unicode issues.  This
307	   principle does not mean that a completely naive person needs to be
308	   able to understand the rationale for why a code point is included,
309	   but it does mean that the reason for inclusion of very peculiar code
310	   points, even if the code points are safe in themselves, will be too
311	   difficult to understand and such code points will therefore be
312	   rejected.

314	   The meaning of "simple" or "readily understood" is context-dependent.
315	   For instance, the root zone has to serve everyone in the world; for
316	   practical purposes, this means that the reasons for including a code
317	   point need to be comprehensible even to people who cannot use the
318	   script where the code point is found.  In a zone that permits a
319	   constrained subset of Unicode characters (for instance, only those
320	   needed to write a single alphabetic language) and that supports a
321	   clearly-delineated linguistic community (for instance, the speakers
322	   of a single language with well-understood written conventions), more
323	   complicated rules might be acceptable.  Compare this principle with
324	   the Least Astonishment Principle in Section 3.2.

326	4.4.  Predictability Principle

328	   The rules for determining whether a code point is to be included
329	   should be predictable enough that those with the requisite
330	   understanding of DNS, IDNA, and Unicode will usually reach the same
331	   conclusion.  This is not a requirement for algorithmic treatment of
332	   code points; as previously noted, that is not possible.  It is rather
333	   to say that the consistent application of professional judgment is
334	   likely to yield the same results; combined with the principle in
335	   Section 4.1, when results are not predictable the anomalous code
336	   point would not be permitted.

338	   Just as in Section 4.3, this principle tends to cause more
339	   restriction the more diverse the community using the zone; it is most
340	   restrictive for the root zone.  This is because what is predictable
341	   within a given language community is possibly very surprising across
342	   languages.

344	4.5.  Stability Principle

346	   Once a code point is permitted, it is at least very hard to stop
347	   permitting that code point.  In public zones (including the root),
348	   the list of code points to be permitted should change very slowly, if
349	   at all, and usually only in the direction of permitting an addition
350	   as time and experience indicates that inclusion of such a code point
351	   is both safe and consistent with these principles.

353	5.  Principle Specific to the Root Zone

355	5.1.  Letter Principle

357	   There is a note in [RFC1123] that top-level labels "will be
358	   alphabetic".  In the absence of widespread agreement about the force
359	   of that note, prudence suggests that U-labels in the root zone should
360	   exclude code points that are not normally used to write words, or
361	   that are in some cases normally used for purposes other than writing
362	   words.  This is not the same as using Unicode's General_Category to
363	   include only letters.  It is a restriction that expands the possible
364	   class of included code points beyond the Unicode letters, but only
365	   expands so far as to include the things that are normally used to
366	   create words.  Under this principle, code points with (for example)
367	   General_Category Mn (Nonspacing_Mark) might be included -- but only
368	   those that are used to write words and not (for instance) musical
369	   symbols.  In addition, such marks should only be used within a label
370	   in ways that they would be used when making a word: combinations that
371	   would be nonsense when used in a word should also be rejected when
372	   tried in DNS labels.  This principle should be applied as narrowly as
373	   possible; as [RFC4690] says, "While DNS labels may conveniently be
374	   used to express words in many circumstances, the goal is not to
375	   express words (or sentences or phrases), but to permit the creation
376	   of unambiguous labels with good mnemonic value."

378	6.  Confusion and Context

380	   While many discussions of confusion have focused on characters, e.g.,
381	   whether two characters are confusable with each other (and under what
382	   circumstances), a focus on characters alone could lead to the
383	   prohibition of very large numbers of labels, including many that
384	   present little risk.  Instead, the focus should be on whether one
385	   label is confusable with another.  For example, if a label contains
386	   several characters that are distinct to a particular script, and all
387	   of its characters are from that script, it is inherently not
388	   confusable with a label from any other script no matter what other
389	   characters might appear in it.  Another label that lacks those
390	   distinguishing characters might be a problem.  The notion extends
391	   from labels to domain names, in the sense that distinguishing
392	   characters used in a higher-level label may set expectations with
393	   respect to the characters in the lower level labels.  This
394	   expectation might be regarded as a benefit, but it is also a problem,
395	   since there is no technical way to require consistent policies in
396	   delegated name spaces.

398	7.  Conclusion

400	   The principles outlined in this document can be applied when
401	   considering any range of Unicode code points for possible inclusion
402	   in a DNS zone.  It is worth observing that doing anything (especially
403	   in light of Section 4.5) implicitly disadvantages communities with a
404	   writing system not yet well understood and not represented in the
405	   technical and policy communities involved in the discussion.  That
406	   disadvantage is to be guarded against as much as practical, but is
407	   effectively impossible to prevent (while still taking action) in
408	   light of imperfect human knowledge.

410	8.  Security Considerations

412	   The principles outlined in this memo are intended to improve
413	   usability and clarity and thereby reduce confusion among different
414	   labels.  While these principles may contribute to reduction of risk,
415	   they are not sufficient to provide a comprehensive
416	   internationalization policy for zone management.

418	   Additional discussion of Unicode security considerations can be found
419	   in [UTR36].

421	9.  IANA Considerations

423	   None.  RFC Editor: this section may be removed on publication.

425	10.  Acknowledgements

427	   The authors thank the participants in the IAB Internationalization
428	   program for the discussion of the ideas in this memo, particularly
429	   Marc Blanchet.  In addition, Stephane Bortzmeyer, Paul Hoffman,
430	   Daniel Kalchev, Panagiotis Papaspiliopoulos, and Vaggelis Segredakis,
431	   made specific comments.

433	11.  IAB Members at the Time of This Writing

435	   Bernard Aboba
436	   Jari Arkko
437	   Marc Blanchet
438	   Ross Callon
439	   Alissa Cooper
440	   Spencer Dawkins
441	   Joel Halpern
442	   Russ Housley
443	   David Kessens
444	   Danny McPherson
445	   Jon Peterson
446	   Dave Thaler
447	   Hannes Tschofenig

449	12.  Informative References

451	   [IABCOMM1]
452	              Internet Architecture Board, "IAB Statement: 'The
453	              interpretation of rules in the ICANN gTLD Applicant
454	              Guidebook.'", February 2012.

456	   [IABCOMM2]
457	              Internet Architecture Board, "Response to ICANN questions
458	              concerning 'The interpretation of rules in the ICANN gTLD
459	              Applicant Guidebook'", March 2012.

461	   [RFC1123]  Braden, R., "Requirements for Internet Hosts - Application
462	              and Support", STD 3, RFC 1123, October 1989.

464	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
465	              Requirement Levels", BCP 14, RFC 2119, March 1997.

467	   [RFC4690]  Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and
468	              Recommendations for Internationalized Domain Names
469	              (IDNs)", RFC 4690, September 2006.

471	   [RFC5890]  Klensin, J., "Internationalized Domain Names for
472	              Applications (IDNA): Definitions and Document Framework",
473	              RFC 5890, August 2010.

475	   [RFC5891]  Klensin, J., "Internationalized Domain Names in
476	              Applications (IDNA): Protocol", RFC 5891, August 2010.

478	   [RFC5892]  Faltstrom, P., "The Unicode Code Points and
479	              Internationalized Domain Names for Applications (IDNA)",
480	              RFC 5892, August 2010.

482	   [RFC5893]  Alvestrand, H. and C. Karp, "Right-to-Left Scripts for
483	              Internationalized Domain Names for Applications (IDNA)",
484	              RFC 5893, August 2010.

486	   [RFC5894]  Klensin, J., "Internationalized Domain Names for
487	              Applications (IDNA): Background, Explanation, and
488	              Rationale", RFC 5894, August 2010.

490	   [RFC5895]  Resnick, P. and P. Hoffman, "Mapping Characters for
491	              Internationalized Domain Names in Applications (IDNA)
492	              2008", RFC 5895, September 2010.

494	   [RFC6365]  Hoffman, P. and J. Klensin, "Terminology Used in
495	              Internationalization in the IETF", BCP 166, RFC 6365,
496	              September 2011.

498	   [UTR36]    Davis, M. and M. Suignard, "Unicode Security
499	              Considerations", Unicode Technical Report #36, July 2012.

501	   [WCAG20]   "Web Content Accessibility Guidelines (WCAG) 2.0",
502	              December 2008.

504	Authors' Addresses

506	   Andrew Sullivan
507	   Dyn, Inc.
508	   150 Dow St
509	   Manchester, NH  03101
510	   U.S.A.

512	   Email: asullivan@dyn.com

514	   Dave Thaler
515	   Microsoft
516	   One Microsoft Way
517	   Redmond, WA  98052
518	   U.S.A.

520	   Email: dthaler@microsoft.com

522	   John C Klensin
523	   1770 Massachusetts Ave, Ste 322
524	   Cambridge, MA  02140
525	   USA

527	   Phone: +1 617 491 5735
528	   Email: john-ietf@jck.com

530	   Olaf Kolkman
531	   NLnet Labs
532	   Science Park 400
533	   Amsterdam  1098 XH
534	   The Netherlands

536	   Email: olaf@NLnetLabs.nl