idnits 2.17.1 

draft-klensin-idna-5892upd-unicode70-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 184: '...ated to True for the label, it MUST be...'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

     (Using the creation date from RFC5982, updated by this document, for
     RFC5378 checks: 2008-05-13)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (July 21, 2014) is 3566 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Duplicate reference: RFC5892, mentioned in 'RFC5892', was also mentioned
     in 'RFC5892Erratum'.

  ** Downref: Normative reference to an Informational RFC: RFC 6943

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Exclusion'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'UAX15-Versioning'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'Unicode62-Arabic'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62-Hamza'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode7'


     Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 9 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                       J.C. Klensin
3	Internet-Draft                                              P. Faltstrom
4	Updates: 5982 (if approved)                                       Netnod
5	Intended status: Standards Track                           July 21, 2014
6	Expires: January 20, 2015

8	                     IDNA Update for Unicode 7.0.0
9	              draft-klensin-idna-5892upd-unicode70-00.txt

11	Abstract

13	   The current version of the IDNA specifications anticipated that each
14	   new version of Unicode would be reviewed to verify that no changes
15	   had been introduced that required adjustments to the set of rules
16	   and, in particular, whether new exceptions or backward compatibility
17	   adjustments were needed.  That review was conducted for Unicode 7.0.0
18	   and identified a problematic new code point.  This specification
19	   updates RFC 5982 to disallow that code point and provides information
20	   about the reasons why that exclusion is appropriate.  It also applies
21	   an editorial clarification that was the subject of an earlier
22	   erratum.

24	Status of this Memo

26	   This Internet-Draft is submitted in full conformance with the
27	   provisions of BCP 78 and BCP 79.

29	   Internet-Drafts are working documents of the Internet Engineering
30	   Task Force (IETF).  Note that other groups may also distribute
31	   working documents as Internet-Drafts.  The list of current Internet-
32	   Drafts is at http://datatracker.ietf.org/drafts/current/.

34	   Internet-Drafts are draft documents valid for a maximum of six months
35	   and may be updated, replaced, or obsoleted by other documents at any
36	   time.  It is inappropriate to use Internet-Drafts as reference
37	   material or to cite them other than as "work in progress."

39	   This Internet-Draft will expire on January 20, 2015.

41	Copyright Notice

43	   Copyright (c) 2014 IETF Trust and the persons identified as the
44	   document authors.  All rights reserved.

46	   This document is subject to BCP 78 and the IETF Trust's Legal
47	   Provisions Relating to IETF Documents (http://trustee.ietf.org/
48	   license-info) in effect on the date of publication of this document.
49	   Please review these documents carefully, as they describe your rights
50	   and restrictions with respect to this document.  Code Components
51	   extracted from this document must include Simplified BSD License text
52	   as described in Section 4.e of the Trust Legal Provisions and are
53	   provided without warranty as described in the Simplified BSD License.

55	Table of Contents

57	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2
58	   2.  Change to RFC 5892 for new character U+08A1  . . . . . . . . .  4
59	   3.  Editorial clarification to RFC 5892  . . . . . . . . . . . . .  4
60	   4.  Explanation  . . . . . . . . . . . . . . . . . . . . . . . . .  5
61	     4.1.  A related historical problem . . . . . . . . . . . . . . .  6
62	     4.2.  How this is being done . . . . . . . . . . . . . . . . . .  7
63	       4.2.1.  Backward compatibility and normalization . . . . . . .  7
64	       4.2.2.  A new contextual rule  . . . . . . . . . . . . . . . .  7
65	   5.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . .  8
66	   6.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . .  8
67	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . .  8
68	   8.  References . . . . . . . . . . . . . . . . . . . . . . . . . .  9
69	     8.1.  Normative References . . . . . . . . . . . . . . . . . . .  9
70	     8.2.  Informative References . . . . . . . . . . . . . . . . . . 10
71	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10

73	1.  Introduction

75	   The current version of the IDNA specifications, known as "IDNA2008"
76	   [RFC5890], anticipated that each new version of Unicode would be
77	   reviewed to verify that no changes had been introduced that required
78	   adjustments to IDNA's rules and, in particular, whether new
79	   exceptions or backward compatibility adjustments were needed.  When
80	   that review was carefully conducted for Unicode 7.0.0 [Unicode7],
81	   comparing it to prior versions including the text in Unicode 6.2
82	   [Unicode62], it identified a problematic new code point (U+08A1,
83	   ARABIC LETTER BEH WITH HAMZA ABOVE).  Section 2 of this specification
84	   updates the portion of the IDNA2008 specification that identifies
85	   rules for what characters are permitted [RFC5892] to disallow that
86	   code point.  It also provides information about the reasons why that
87	   exclusion is appropriate.

89	   As anticipated when IDNA2008, and RFC 5892 in particular, were
90	   written, exceptions and explicit updates are likely to be needed only
91	   if there is disagreement between the Unicode Consortium's view about
92	   what is best for the Standard and the IETF's view of what is best for
93	   IDNs, the DNS, and IDNA.  It was hoped that a situation would never
94	   arise in which the the two perspectives would disagree, but the
95	   possibility was anticipated and considerable mechanism added to RFC
96	   5890 and 5982 as a result.  It is probably important to note that a
97	   disagreement in this context does not imply that anyone is "wrong",
98	   only that the two different groups have different needs and therefore
99	   criteria about what is acceptable.  For that reason, the IETF has, in
100	   the past, allowed some characters for IDNA that active Unicode
101	   Technical Committee members suggested be disallowed to avoid a change
102	   in derived tables [RFC6452].  This document describes a case where
103	   the IETF should disallow a character that the various properties
104	   would otherwise treat as PVALID.

106	   This document provides the "flagging for the IESG" specified by
107	   Section 5.1 of RFC 5892.  As specified there, the change itself
108	   requires IETF review because it alters the rules of Section 2 of that
109	   document.

111	   Readers of this document are expected to be familiar with Unicode
112	   terminology [Unicode62] and the IETF conventions for representing
113	   Unicode code points [RFC5137].

115	   As a convenience to readers of RFC 5892 and to reduce the risks of
116	   confusion, this document also formally applies the content of an
117	   erratum to the text of the RFC (see Section 3) and so brings that RFC
118	   up to date with all agreed changes.

120	      [[RFC Editor: please remove the following comment and note if they
121	      get to you.]]

123	      [[IESG: It might not be a bad idea to incorporate some version of
124	      the following into the Last Call announcement.]]

126	      NOTE IN DRAFT to IETF Reviewers: The issues in this document, and
127	      particularly the extended discussion below of why this change to
128	      RFC 5892 is necessary and appropriate, are fairly esoteric.
129	      Understanding them requires that one have at least some
130	      understanding of how the Arabic Script works and the reasons the
131	      Unicode Standard gives various Arabic Script characters a fairly
132	      extended discussion.  It also requires understanding of a number
133	      of Unicode principles, including the Normalization Stability rules
134	      as applied to new precomposed characters and guidelines for adding
135	      new characters.    References are provided for those who want to
136	      pursue them, but potential reviewers should assume that the
137	      background needed to understand the reasons for this change is no
138	      less deep in the subject matter than would be expected of someone
139	      reviewing a proposed change in, e.g., the fundamentals of BGP, TCP
140	      congestion control, or some cryptographic algorithm.

142	2.  Change to RFC 5892 for new character U+08A1

144	   With the publication of this document, Section 2.6 ("Exceptions (F)")
145	   of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in
146	   Category F so that the rule itself reads:

148	   F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660,
149	                0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668,
150	                0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6,
151	                06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 08A1, 0F0B,
152	                3007, 302E, 302F, 3031, 3032, 3033, 3034, 3035,
153	                303B, 30FB}

155	   and then add to the subtable designated
156	   "DISALLOWED -- Would otherwise have been PVALID"
157	   after the line that begins "07FA", the additional line:

159	      08A1; DISALLOWED # ARABIC LETTER BEH WITH HAMZA ABOVE

161	   This has the effect of making the cited code point DISALLOWED
162	   independent of application of the rest of the IDNA rule set to the
163	   current version of Unicode.  Those wishing to create domain name
164	   labels containing Beh with Hamza Above may continue to use the
165	   sequence

167	      U+0628, ARABIC LETTER BEH
168	   followed by

170	      U+0654, ARABIC HAMZA ABOVE

172	   which was valid for IDNA purposes in Unicode 5.0 and earlier and
173	   which continues to be valid.

175	3.  Editorial clarification to RFC 5892

177	   Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a
178	   clarification to Appendix A and Section A.1 of RFC 5892.  This
179	   section of this document updates the RFC to apply that clarification.

181	   1.  In Appendix A, add a new paragraph after the paragraph that
182	       begins "The code point...".  The new paragraph should read:

184	   "For the rule to be evaluated to True for the label, it MUST be
185	   evaluated separately for every occurrence of the Code point in the
186	   label; each of those evaluations must result in True."

188	   2.  In Appendix A, Section A.1, replace the "Rule Set" by

190	     Rule Set:
191	       False;
192	       If Canonical_Combining_Class(Before(cp)) .eq.  Virama Then True;
193	       If cp .eq. \u200C And
194	              RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp
195	         (Joining_Type:T)*(Joining_Type:{R,D})) Then True;

197	4.  Explanation

199	   [[NOTE IN DRAFT: Given the nature of this document, we believe this
200	   material belongs here.  It could, however, be moved to an appendix if
201	   anyone felt strongly about that.]]

203	   This section summarizes some of the discussions and reasoning that
204	   led to the conclusion and change in Section 2.  It should not be
205	   considered as either normative or authoritative.

207	   As the Unicode Standard points out at some length [Unicode62-Arabic],
208	   Hamza is a problematic abstract character and the "Hamza Above"
209	   construction even more so [Unicode62-Hamza].  Those sections explain
210	   a distinction made by Unicode between the use of a Hamza mark to
211	   denote a glottal stop and one used as a diacritic mark to denote a
212	   separate letter.  In the first case, the combining sequence is used.
213	   In the second, a precombined character is assigned.

215	   Unlike Unicode generally and because of concerns about identifier
216	   spoofing and attacks based on similarities, character distinctions in
217	   IDNA are based much more strictly on the appearance of characters;
218	   pronunciation distinctions are not considered.  So, for IDNA, BEH
219	   WITH HAMZA ABOVE is not-quite-tautologically the same as BEH WITH
220	   HAMZA ABOVE, even if one of them is written as U+08A1 (new to Unicode
221	   7.0.0) and the other as the sequence \u'0628'\u'0654' (feasible with
222	   Unicode 7.0.0 but also available in versions of Unicode going back at
223	   least to the original publication of RFC 5892).   Because the two
224	   are, for IDNA purposes, the same, IDNA expects that normalization
225	   (specifically the requirement that all U-labels be in NFC form) will
226	   cause them to compare equal.

228	   If Unicode also considered them the same, then the principle would
229	   apply that new precomposed ("composition") forms are not added unless
230	   one of the code points that could be used to construct it did not
231	   exist in an earlier version (and even then is
232	   discouraged)[UAX15-Versioning].  When exceptions are made, they are
233	   expected to conform to the rules and classes in the "Composition
234	   Exclusion Table", with class 2 being relevant to this case
235	   [UAX15-Exclusion].  That rule essentially requires that the
236	   normalization for the old combining sequence to itself be retained
237	   (for stability) but that the newly-added character be treated as
238	   canonically decomposable and decompose back to the older sequence
239	   even under NFC.  That was not done for this particular case,
240	   presumably because of the distinction about prounciation modifiers
241	   versus separate letters noted above.  Because, for IDNA and the DNS,
242	   there is a possibility that the composing sequence \u'0628'\u'0654'
243	   already appears in labels, the only choice other than allowing an
244	   otherwise-identical, and identically-appearing, label with U+08A1
245	   substituted to identify a different DNS entry is to DISALLOW the new
246	   character.

248	4.1.  A related historical problem

250	   At least three other grapheme clusters have been present for many
251	   version of Unicode and can be seen as involving issues similar to
252	   those for the newly-added ARABIC LETTER BEH WITH HAMZA ABOVE.  ARABIC
253	   LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER REH WITH HAMZA
254	   ABOVE (U+076C) do not have decomposition forms and are preferred over
255	   combining sequences using HAMZA ABOVE (U+0654) [Unicode62-Hamza].  By
256	   contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE (U+0623) decomposes
257	   into \u'0627'\u'0653' and ARABIC LETTER YEH WITH HAMZA ABOVE (U+0626)
258	   decomposes into \u'064A'\u'0654' so the precomposed character and
259	   combining sequences compare equal when both are normalized, as this
260	   specification prefers.

262	   There are other variations on this theme.  For example, ARABIC LETTER
263	   U WITH HAMZA ABOVE (U+0677) has a compatibility decomposition into
264	   the combining sequence \u'06C7'\u'0674'.

266	   Had the issues outlined in this document been better understood at
267	   the time, it probably would have been wise for RFC 5892 to disallow
268	   either the precomposed character or the combining sequence of each
269	   pair unless Unicode normalization rules cause the right thing to
270	   happen.  Failure to do so at the time places an extra burden on
271	   registries to be sure that conflicts (and the potential for confusion
272	   and attacks) do not exist.   Oddly, had the exclusion been made part
273	   of the specification at that time, the preference noted above would
274	   probably have dictated excluding the combining sequence, something
275	   not otherwise done in IDNA2008.  Today, the only thing that can be
276	   excluded without the potential disruption of disallowing a
277	   previously-PVALID combining sequence is the newly-added code point so
278	   whatever is done, or might have been contemplated with hindsight, it
279	   would be somewhat inconsistent.

281	4.2.  How this is being done

283	   Questions have arisen has to why this specification makes the change
284	   to RFC 5892 by DISALLOWing U+08A1 as a simple exception (IDNA
285	   Category F, RFC 5892 Section 2.7) rather than either a backward-
286	   compatibility case (IDNA Category G, RFC 5982 Section 2.8) or
287	   modifying IDNA Category F to make Hamza (or Hamza Above, or combining
288	   Hamza generally) into CONTEXTO cases and specifying appropriate
289	   limitations in a new entry in the IANA IDNA Context Registry (as
290	   specified in RFC 5892 Section 5.2).  The subsections below explain
291	   why neither of those alternatives was chosen despite some discussion
292	   of each.

294	4.2.1.  Backward compatibility and normalization

296	   The "BackwardCompatible" category (IDNA Category G, RFC 5892 Section
297	   5.3) is described as applying only when "property values in versions
298	   of Unicode after 5.2 have changed in such a way that the derived
299	   property value would no longer be PVALID or DISALLOWED".   Because
300	   U+08A1 is a newly-added code point in Unicode 7.0.0 and no property
301	   values of code points in prior versions have changed, that category G
302	   does not apply.   If that section of RFC 5892 is replaced in the
303	   future, perhaps consideration should be given to adding Normalization
304	   Stability and other issues to that description but, at present, it is
305	   not relevant.

307	4.2.2.  A new contextual rule

309	   As the Unicode Standard points out at some length [Unicode62-Arabic],
310	   Hamza is a problematic abstract character and the "Hamza Above"
311	   construction even more so.  IDNA has historically associated
312	   characters whose use is reasonable in some contexts but not others
313	   with the special derived property "CONTEXTO" and then specified
314	   specific, context-dependent, rules about where they may be used.
315	   Because Hamza Above is problematic (and spawns edge cases, as
316	   discussed in the Unicode Standard section cited above), it was
317	   suggested that a contextual rule might be appropriate.   There are at
318	   least two reasons why a contextual rule would not be suitable for the
319	   present situation.

321	   1.  As discussed above, the present situation is a normalization
322	       stability and predictability problem, not a contextual one.  Had
323	       the same issues arisen with a newly-added precomposed character
324	       that could previously be constructed from non-problematic base
325	       and combining characters, it would be even more clearly a
326	       normalization issue and, following the principles discussed there
327	       and particularly in UAX 15 [UAX15-Exclusion], might not have been
328	       assigned at all.

330	   2.  The contextual rule sets are designed around restricting the use
331	       of code points to a particular script or adjacent to particular
332	       characters within that script.  Neither of these cases applies to
333	       the newly-added character even if one could imagine rules for the
334	       use of Hamza Above (U+0654) that would reflect the considerations
335	       of Chapter 8 of Unicode 6.2.  Even had the latter been desired,
336	       it would be somewhat late now -- Hamza Above has been present as
337	       a combining character (U+0654) in many versions of Unicode.
338	       While that section of the Unicode Standard describes the issues,
339	       it does not provide actionable guidance about what to do about it
340	       for cases going forward or when visual identity is important.

342	5.  Acknowledgements

344	   The Unicode 7.0.0 changes were extensively discussed within the IAB's
345	   Internationalization Program.  The authors are grateful for the
346	   discussions and feedback there, especially from Andrew Sullivan and
347	   David Thaler.  Additional information was requested and received from
348	   Mark Davis and Ken Whistler and while they probably do not agree with
349	   the necessity of excluding this code point as their responsibility is
350	   to look at the Unicode Consortium requirements for stability, the
351	   decision would not have been possible without their input.  Several
352	   experts and reviewers who prefer to remain anonymous also provided
353	   helpful input and comments on preliminary versions of this document.

355	6.  IANA Considerations

357	   When the IANA registry and tables are updated to reflect Unicode
358	   7.0.0, code point U+08A1 should be identified as DISALLOWED,
359	   consistent with the change made in Section 2.

361	7.  Security Considerations

363	   This specification excludes a code point for which the Unicode-
364	   specified normalization behavior could result in two ways to form a
365	   visually-identical character within the same script not comparing
366	   equal.   That behavior could create a dream case for someone
367	   intending to confuse the user by use of a domain name that looked
368	   identical to another one, was entirely in the same script, but was
369	   still considered different (see, for example, the discussion of false
370	   negatives in identifier comparison in Section 2.1 of RFC 6943
371	   [RFC6943]).  This exclusion therefore should improve Internet
372	   security.

374	8.  References

376	8.1.  Normative References

378	   [RFC5137]  Klensin, J., "ASCII Escaping of Unicode Characters", BCP
379	              137, RFC 5137, February 2008.

381	   [RFC5890]  Klensin, J., "Internationalized Domain Names for
382	              Applications (IDNA): Definitions and Document Framework",
383	              RFC 5890, August 2010.

385	   [RFC5892Erratum]
386	              "RFC5892, "The Unicode Code Points and Internationalized
387	              Domain Names for Applications (IDNA)", August 2010, Errata
388	              ID: 3312", Errata ID 3312, August 2012, <http://www.rfc-
389	              editor.org/errata_search.php?rfc=5892>.

391	   [RFC5892]  Faltstrom, P., "The Unicode Code Points and
392	              Internationalized Domain Names for Applications (IDNA)",
393	              RFC 5892, August 2010.

395	   [RFC6943]  Thaler, D., "Issues in Identifier Comparison for Security
396	              Purposes", RFC 6943, May 2013.

398	   [UAX15-Exclusion]
399	              Davis, M., Ed., "Unicode Standard Annex #15: Unicode
400	              Normalization Forms, Section 5", June 2014, <http://
401	              www.unicode.org/reports/tr15/
402	              #Primary_Exclusion_List_Table>.

404	   [UAX15-Versioning]
405	              Davis, M., Ed., "Unicode Standard Annex #15: Unicode
406	              Normalization Forms, Section 3", June 2014, <http://
407	              www.unicode.org/reports/tr15/#Versioning>.

409	   [Unicode62-Arabic]
410	              "The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8",
411	              Chapter 8, 2012, <http://www.unicode.org/versions/
412	              Unicode6.2.0/ch08.pdf>.

414	              Subsection titled "Encoding Principles", paragraph
415	              numbered 4, starting on page 251.

417	   [Unicode62-Hamza]
418	              "The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8",
419	              Chapter 8, 2012, <http://www.unicode.org/versions/
420	              Unicode6.2.0/ch08.pdf>.

422	              Subsection titled "Combining Hamza Above" starting on page
423	              263.

425	   [Unicode62]
426	              The Unicode Consortium, "The Unicode Standard, Version
427	              6.2.0", ISBN 978-1-936213-07-8, 2012, <http://
428	              www.unicode.org/versions/Unicode6.2.0/>.

430	              Preferred citation: The Unicode Consortium.  The Unicode
431	              Standard, Version 6.2.0, (Mountain View, CA: The Unicode
432	              Consortium, 2012. ISBN 978-1-936213-07-8)

434	   [Unicode7]
435	              The Unicode Consortium, "The Unicode Standard, Version
436	              7.0.0", ISBN 978-1-936213-09-2, 2014, <http://
437	              www.unicode.org/versions/Unicode7.0.0/>.

439	              Preferred Citation: The Unicode Consortium.  The Unicode
440	              Standard, Version 7.0.0, (Mountain View, CA: The Unicode
441	              Consortium, 2014.  ISBN 978-1-936213-09-2)

443	8.2.  Informative References

445	   [RFC6452]  Faltstrom, P. and P. Hoffman, "The Unicode Code Points and
446	              Internationalized Domain Names for Applications (IDNA) -
447	              Unicode 6.0", RFC 6452, November 2011.

449	Authors' Addresses

451	   John C Klensin
452	   1770 Massachusetts Ave, Ste 322
453	   Cambridge, MA 02140
454	   USA

456	   Phone: +1 617 245 1457
457	   Email: john-ietf@jck.com

459	   Patrik Faltstrom
460	   Netnod
461	   Franzengatan 5
462	   Stockholm, 112 51
463	   Sweden

465	   Phone: +46 70 6059051
466	   Email: paf@netnod.se