idnits 2.17.1 

draft-klensin-idna-5892upd-unicode70-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == It seems as if not all pages are separated by form feeds - found 28 form
     feeds but 744 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 1033: '...ated to True for the label, it MUST be...'

  -- The draft header indicates that this document updates RFC5892, but the
     abstract doesn't seem to directly say this.  It does mention RFC5892
     though, so this could be OK.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

     (Using the creation date from RFC5892, updated by this document, for
     RFC5378 checks: 2008-04-26)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (March 10, 2015) is 3335 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'PRECIS-Framework'

  -- Duplicate reference: RFC5892, mentioned in 'RFC5892Erratum', was also
     mentioned in 'RFC5892'.

  ** Downref: Normative reference to an Informational RFC: RFC 5894

  ** Downref: Normative reference to an Informational RFC: RFC 6943

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Exclusion'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'UAX15-Versioning'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTS46'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'Unicod70-CompatDecomp'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'Unicod70-Overlay'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode5'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode7'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'Unicode70-Arabic'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'Unicode70-Design'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode70-Hamza'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'Unicode70-Stability'

  -- Obsolete informational reference (is this intentional?): RFC 3490
     (Obsoleted by RFC 5890, RFC 5891)


     Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 19 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                         J. Klensin
3	Internet-Draft
4	Updates: 5892, 5894 (if approved)                           P. Faltstrom
5	Intended status: Standards Track                                  Netnod
6	Expires: September 11, 2015                               March 10, 2015

8	                     IDNA Update for Unicode 7.0.0
9	              draft-klensin-idna-5892upd-unicode70-04.txt

11	Abstract

13	   The current version of the IDNA specifications anticipated that each
14	   new version of Unicode would be reviewed to verify that no changes
15	   had been introduced that required adjustments to the set of rules
16	   and, in particular, whether new exceptions or backward compatibility
17	   adjustments were needed.  The review for Unicode 7.0.0 first
18	   identified a potentially problematic new code point and then a much
19	   more general and difficult issue with Unicode normalization.  This
20	   specification discusses those issues and proposes updates to IDNA
21	   and, potentially, the way the IETF handles comparison of identifiers
22	   more generally, especially when there is no associated language or
23	   language identification.  It also applies an editorial clarification
24	   to RFC 5892 that was the subject of an earlier erratum and updates
25	   RFC 5894 to point to the issues involved.

27	Status of This Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at http://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on September 11, 2015.

44	Copyright Notice

46	   Copyright (c) 2015 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
62	   2.  Document Aspirations  . . . . . . . . . . . . . . . . . . . .   6
63	   3.  Problem Description . . . . . . . . . . . . . . . . . . . . .   7
64	     3.1.  IDNA assumptions about Unicode normalization  . . . . . .   7
65	     3.2.  The discovery and the Arabic script cases . . . . . . . .   9
66	       3.2.1.  New code point U+08A1, decomposition, and language
67	               dependency  . . . . . . . . . . . . . . . . . . . . .   9
68	       3.2.2.  Other examples of the same behavior within the Arabic
69	               Script  . . . . . . . . . . . . . . . . . . . . . . .  10
70	       3.2.3.  Hamza and Combining Sequences . . . . . . . . . . . .  10
71	     3.3.  Precomposed characters without decompositions more
72	           generally . . . . . . . . . . . . . . . . . . . . . . . .  11
73	       3.3.1.  Description of the general problem  . . . . . . . . .  11
74	       3.3.2.  Latin Examples and Cases  . . . . . . . . . . . . . .  12
75	       3.3.3.  Examples and Cases from Other Scripts . . . . . . . .  14
76	       3.3.4.  Scripts with precomposed preferences and ones with
77	               combining preferences . . . . . . . . . . . . . . . .  15
78	     3.4.  Confusion and the casual user . . . . . . . . . . . . . .  15
79	   4.  Implementation options and issues: Unicode properties,
80	       exceptions, and the nature of stability . . . . . . . . . . .  15
81	     4.1.  Unicode Stability compared to IETF (and ICANN) Stability   15
82	     4.2.  New Unicode Properties  . . . . . . . . . . . . . . . . .  17
83	     4.3.  The need for exception lists  . . . . . . . . . . . . . .  18
84	   5.  Proposed/ Alternative Changes to RFC 5892 for the issues
85	       first exposed by new code point U+08A1  . . . . . . . . . . .  18
86	     5.1.  Disallow This New Code Point  . . . . . . . . . . . . . .  18
87	     5.2.  Disallow This New Code Point and All Future Precomposed
88	           Additions that do not decompose . . . . . . . . . . . . .  19
89	     5.3.  Disallow the combining sequences for these characters . .  19
90	     5.4.  Disallow all Combining Characters for Specific Scripts  .  21
91	     5.5.  Do Nothing Other Than Warn  . . . . . . . . . . . . . . .  21
92	     5.6.  Normalization Form IETF (NFI))  . . . . . . . . . . . . .  21
93	   6.  Editorial clarification to RFC 5892 . . . . . . . . . . . . .  22
94	   7.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  23
95	   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  23
96	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  23
97	   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  24
98	     10.1.  Normative References . . . . . . . . . . . . . . . . . .  24
99	     10.2.  Informative References . . . . . . . . . . . . . . . . .  27
100	   Appendix A.  Change Log . . . . . . . . . . . . . . . . . . . . .  28
101	     A.1.  Changes from version -00 to -01 . . . . . . . . . . . . .  28
102	     A.2.  Changes from version -01 to -02 . . . . . . . . . . . . .  28
103	     A.3.  Changes from version -02 to -03 . . . . . . . . . . . . .  29
104	     A.4.  Changes from version -03 to -04 . . . . . . . . . . . . .  29
105	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  29

107	1.  Introduction

109	      Note in/about -04 Draft: This version of the document contains a
110	      very large amount of new material as compared to the -03 version.
111	      The new material reflects an evolution of community understanding
112	      in the last two months from an assumption that the problem
113	      involved only a few code points and one combining character in a
114	      single script (Hamza Above and Arabic) to an understanding that it
115	      is quite pervasive and may represent fundamental misunderstandings
116	      or omissions from IDNA2008 (and, by extension, the basics of
117	      PRECIS [PRECIS-Framework]) that must be corrected if those
118	      protocols are going to be used in a way that supports Internet
119	      internationalized identifiers predictability (as seen by the end
120	      user) and security.

122	      This version is still necessarily incomplete: not only is our
123	      understanding probably still not comprehensive, but there are a
124	      number of placeholders for text and references.  Nonetheless, the
125	      document in its current form should be useful as both the
126	      beginning of a comprehensive overview is the issues and a source
127	      of references to other relevant materials.

129	      This draft could almost certainly be organized better to improve
130	      its readability: specific suggestion would be welcome.

132	   The current version of the IDNA specifications, known as "IDNA2008"
133	   [RFC5890], anticipated that each new version of Unicode would be
134	   reviewed to verify that no changes had been introduced that required
135	   adjustments to IDNA's rules and, in particular, whether new
136	   exceptions or backward compatibility adjustments were needed.  When
137	   that review was carefully conducted for Unicode 7.0.0 [Unicode7],
138	   comparing it to prior versions including the text in Unicode 6.2
139	   [Unicode62], it identified a problematic new code point (U+08A1,
140	   ARABIC LETTER BEH WITH HAMZA ABOVE).  The code point was added for
141	   use with the Fula (also known as Fulfulde, Pulaar, amd Pular'Fulaare)
142	   language, a language that, apparently, is most often written in Latin
143	   characters today [Omniglot-Fula] [Dalby] [Daniels].

145	   The specific problem is discussed in detail in Section 3.  In very
146	   broad terms, IDNA (and other IETF work) assume that, if one can
147	   represent "the same character" either as a combining sequence or as a
148	   single code point, strings that are identical except for those
149	   alternate forms will compare equal after normalization.  Part of the
150	   difficulty that has characterized this discussion is that "the same"
151	   differs depending on the criteria that are chosen.

153	   The behavior of the newly-added code point, while non-optimal for
154	   IDNA, follows that of a few code points that predate Unicode 7.x and
155	   even the IDNA 2008 specifications and Unicode 6.0.  Those existing
156	   code points, which may not be easy to accurately characterize as a
157	   group, make the question of what, if anything, to do about this new
158	   exceedingly problematic one and, perhaps separately, what to do about
159	   existing sets of code points with the same behavior, because
160	   different reasonable criteria yield different decisions,
161	   specifically:

163	   o  To disallow it (and future, but not existing characters with
164	      similar characteristics) as an IDNA exception case creates
165	      inconsistencies with how those earlier code points were handled.

167	   o  To disallow it and the similar code points as well would
168	      necessitate invalidating some potential labels that would have
169	      been valid under IDNA2008 until this time.  Depending on how the
170	      collection of similar code points is characterized, a few of them
171	      are almost certainly used in reasonable labels.

173	   o  To permit the new code point to be treated as PVALID creates a
174	      situation in which it is possible, within the same script, to
175	      compose the same character symbol (glyph) in two different ways
176	      that do not compare equal even after normalization.  That
177	      condition would then apply to it and the earlier code points with
178	      the same behavior.  That situation contradicts a fundamental
179	      assumption of IDNA that is discussed in more detail below.

181	   NOTE IN DRAFT:

183	      This working draft discusses six alternatives, including an idea
184	      (an IETF-specific normalization form) that seemed too drastic to
185	      be considered a few months ago.  However, it not only would have
186	      been appropriate to discuss when the IDNA2008 specifications were
187	      being developed but is appearing more attractive now.  The authors
188	      suggest that the community discuss the relevant tradeoffs and make
189	      a decision and that the document then be revised to reflect that
190	      decision, with the other alternatives discussed as options not
191	      chosen.  Because there is no ideal choice, the discussion of the
192	      issues in Section 3, is probably as or more important than the
193	      particular choice of how to handle this code point.  In addition
194	      to providing information for this document, that section should be
195	      considered as an updating addendum to RFC 5894 [RFC5894] and
196	      should be incorporated into any future revision of that document.

198	      As the result of this version of the document containing several
199	      alternate proposals, some of the text is also a little bit
200	      redundant.  That will be corrected in future versions.

202	   As anticipated when IDNA2008, and RFC 5892 in particular, were
203	   written, exceptions and explicit updates are likely to be needed only
204	   if there is disagreement between the Unicode Consortium's view about
205	   what is best for the Standard and the IETF's view of what is best for
206	   IDNs, the DNS, and IDNA.  It was hoped that a situation would never
207	   arise in which the the two perspectives would disagree, but the
208	   possibility was anticipated and considerable mechanism added to RFC
209	   5890 and 5982 as a result.  It is probably important to note that a
210	   disagreement in this context does not imply that anyone is "wrong",
211	   only that the two different groups have different needs and therefore
212	   criteria about what is acceptable.  For that reason, the IETF has, in
213	   the past, allowed some characters for IDNA that active Unicode
214	   Technical Committee members suggested be disallowed to avoid a change
215	   in derived tables [RFC6452].  This document describes a case where
216	   the IETF should disallow a character or characters that the various
217	   properties would otherwise treat as PVALID.

219	   This document provides the "flagging for the IESG" specified by
220	   Section 5.1 of RFC 5892.  As specified there, the change itself
221	   requires IETF review because it alters the rules of Section 2 of that
222	   document.

224	      [[RFC Editor: please remove the following comment and note if they
225	      get to you.]]

227	      [[IESG: It might not be a bad idea to incorporate some version of
228	      the following into the Last Call announcement.]]

230	      NOTE IN DRAFT to IETF Reviewers: The issues in this document, and
231	      particularly the choices among options for either adding exception
232	      cases to RFC 5892 or ignoring the issue, warning people, and
233	      hoping the results do not include serious problems, are fairly
234	      esoteric.  Understanding them requires that one have at least some
235	      understanding of how the Arabic Script (and perhaps other scripts
236	      in which precomposed characters are preferred over combining
237	      sequences as a Unicode design and extension principle) works and
238	      the reasons the Unicode Standard gives various Arabic Script
239	      characters a fairly extended discussion [Unicode70-Arabic].  It
240	      also requires understanding of a number of Unicode principles,
241	      including the Normalization Stability rules [UAX15-Versioning] as
242	      applied to new precomposed characters and guidelines for adding
243	      new characters.  There is considerable discussion of the issues in
244	      Section 3 and references are provided for those who want to pursue
245	      them, but potential reviewers should assume that the background
246	      needed to understand the reasons for this change is no less deep
247	      in the subject matter than would be expected of someone reviewing
248	      a proposed change in, e.g., the fundamentals of BGP, TCP
249	      congestion control, or some cryptographic algorithm.  Put more
250	      bluntly, one's ability to read or speak languages other than
251	      English, or even one or more languages that use the Arabic script
252	      or other scripts similarly affected, does not make one an expert
253	      in these matters.

255	   This document assumes that the reader is reasonably familiar with the
256	   terminology of IDNA [RFC5890] and Unicode [Unicode7] and with the
257	   IETF conventions for representing Unicode code points [RFC5137].
258	   Some terms used here may not be used in the same way in those two
259	   sets of documents.  From one point of view, those differences may
260	   have been the results of, or led to, misunderstandings that may, in
261	   turn, be part of the root cause of the problems explored in this
262	   document.  In particular, this document uses the term "precomposed
263	   character" to describe characters that could reasonably be composed
264	   by a combining sequence using code points in the same but for which a
265	   single code point that does not require combining sequences is
266	   available.  That definition is strictly about mechanical composition
267	   and does not involve any considerations about how the character is
268	   used.  It is closely related to this document's definition of
269	   "identical".  When a precomposed character exists and either applying
270	   NFC to the combining sequence does not yield that character or
271	   applying NFD to that character's code point does not yield the
272	   combining sequence, it is referred to in this document as "non-
273	   decomposable"

275	2.  Document Aspirations

277	   This document, in its present form, is not a proposal for a solution.
278	   Instead, it is intended to be (or evolve into) a comprehensive
279	   description of the issues and problems and to outline some possible
280	   approaches to a solution.  A perfect solution -- one that would
281	   resolve all of the issues identified in this document, would involve
282	   a relatively small set of relatively simple rules and hence would be
283	   comprehensible and predictable for and by non-expert end users, would
284	   not require code point by code point or even block by block exception
285	   lists, and would not leave uses of any script or language feeling
286	   that their particular writing system have been treated less fairly
287	   than others.

289	   Part of the reality we need to accept is that IDNA, in its present
290	   form, represents compromises that does not completely satisfy those
291	   criteria and whatever is done about these issues will probably make
292	   it (or the job of administering zones containing IDNs) more complex.
293	   Similarly, as the Unicode Standard suggests when it identifies ten
294	   Design Principles and the text then says "Not all of these principles
295	   can be satisfied simultaneously..." [Unicode70-Design], while there
296	   are guidelines and principles, a certain amount of subjective
297	   judgment is involved in making determinations about normalization,
298	   decomposition, and some property values.  For Unicode itself, those
299	   issues are resolved by multiple statements (at least one cited below)
300	   that one needs to rely on per-code point information in the Unicode
301	   Character Database rather than on rules or principles.  The design of
302	   IDNA and the effort to keep it largely independent of Unicode
303	   versions requires rules, categories, and principles that can be
304	   relied upon and applied algorithmically.  There is obviously some
305	   tension between the two approaches.

307	3.  Problem Description

309	3.1.  IDNA assumptions about Unicode normalization

311	   IDNA makes several assumptions about Unicode, Unicode "characters",
312	   and the effects of normalization.  Those assumptions were based on
313	   careful reading of the Unicode Standard at the time [Unicode5],
314	   guided by advice and commitments by members of the Unicode Technical
315	   Committee.  Those assumptions, and the associated requirements, are
316	   necessitated by three properties of DNS labels that typically do not
317	   apply to blocks of running text:

319	   1.  There is no language context for a label.  While particular DNS
320	       zones may impose restrictions, including language or script
321	       restrictions, on what labels can be registered, neither the DNS
322	       nor IDNA impose either type of restriction or give the user of a
323	       label any indication about the registration or other restrictions
324	       that may have been imposed.

326	   2.  Labels are often mnemonics rather than words in any language.
327	       They may be abbreviations or acronyms or contain embedded digits
328	       and have other characteristics that are not typical of words.

330	   3.  Labels are, in practice, usually short.  Even when they are the
331	       maximum length allowed by the DNS and IDNA, they are typically
332	       too short to provide significant context.  Statements that
333	       suggest that languages can almost always be determined from
334	       relatively short paragraphs or equivalent bodies of text do not
335	       apply to DNS labels because of their typical short length and
336	       because, as noted above, they are not required to be formed
337	       according to language-based rules.

339	   At the same time, because the DNS is an exact-match system, there
340	   must be no ambiguity about whether two labels are equal.  Although
341	   there have been extensive discussions about "confusingly similar"
342	   characters, labels, and strings, such tests between scripts are
343	   always somewhat subjective: they are affected by choices of type
344	   styles and by what the user expects to see.  In spite of the fact
345	   that the glyphs that represent many characters in different scripts
346	   are identical in appearance (e.g., basic Latin "a" (U+0061) and the
347	   identical-appearing Cyrillic character (U+0430), the most important
348	   test is that, if two glyphs are the same within a given script, they
349	   must represent the same character no matter how they are formed.

351	   Unicode normalization, as explained in [UAX15], is expected to
352	   resolve those "same script, same glyph, different formation methods"
353	   issues.  Within the Latin script, the code point sequence for lower
354	   case "o" (U+006F) and combining diaeresis (U+0308) will, when
355	   normalized using the "NFC" method required by IDNA, produce the
356	   precomposed small letter o with diaeresis (U+00F6) and hence the two
357	   ways of forming the character will compare equal (and the combining
358	   sequence is effectively prohibited from U-labels).

360	   NFC was preferred over other normalization methods for IDNA because
361	   it is more compact, more likely to be produced on keyboards on which
362	   the relevant characters actually appeared, and because it does not
363	   lose substantive information (e.g., some types of compatibility
364	   equivalence involves judgment calls as to whether two characters are
365	   actually the same -- they may be "the same" in some contexts but not
366	   others -- while canonical equivalence is about different ways to
367	   produce the glyph for the same abstract character).

369	   IDNA also assumed that the extensive Unicode stability rules would be
370	   applied and work as specified when new code points were added.  Those
371	   rules, as described in The Unicode Standard and the normative annexes
372	   identified below, provide that:

374	   1.  New code points representing precomposed characters that can be
375	       formed from combining sequences will not be added to Unicode
376	       unless neither the relevant base character nor required combining
377	       character(s) are part of the Standard within the relevant script
378	       [UAX15-Versioning].

380	   2.  If circumstances require that principle be violated,
381	       normalization stability requires that the newly-added character
382	       decompose (even under NFC) to the previously-available combining
383	       sequence [UAX15-Exclusion].

385	   At least at the time IDNA2008 was being developed, there was no
386	   explicit provision in the Standard's discussion of conditions for
387	   adding new code points, nor of normalization stability, for an
388	   exception based on different languages using the same script or
389	   ambiguities about the shape or positioning of combining characters.

391	3.2.  The discovery and the Arabic script cases

393	   While the set of problems with normalization discussed above were
394	   discovered with a newly-added code point for the Arabic Script and
395	   some characteristics of Unicode handling of that script seem to make
396	   the problem more complex going forward, these are not issues specific
397	   to Arabic.  This section describes the Arabic-specific problems;
398	   subsequent ones (starting with Section 3.3) discuss the problem more
399	   generally and include illustrations from other scripts.

401	3.2.1.  New code point U+08A1, decomposition, and language dependency

403	   Unicode 7.0.0 introduces the new code point U+08A1, ARABIC LETTER BEH
404	   WITH HAMZA ABOVE.  As can be deduced from the name, it is visually
405	   identical to the glyph that can be formed from a combining sequence
406	   consisting of the code point for ARABIC LETTER BEH (U+0628) and the
407	   code point for Combining Hamza Above (U+0654).  The two rules
408	   summarized above (see the last part of Section 3.1) suggest that
409	   either the new code point should not be allocated at all or that it
410	   should have a decomposition to \u'0628'\u'0654'.

412	   Had the issues outlined in this document been better understood at
413	   the time, it probably would have been wise for RFC 5892 to disallow
414	   either the precomposed character or the combining sequence of each
415	   pair in those cases in which Unicode normalization rules do not cause
416	   the right thing to happen, i.e., the combining sequence and
417	   precomposed character to be treated as equivalent.  Failure to do so
418	   at the time places an extra burden on registries to be sure that
419	   conflicts (and the potential for confusion and attacks) do not exist.
420	   Oddly, had the exclusion been made part of the specification at that
421	   time, the preference for precomposed forms noted above would probably
422	   have dictated excluding the combining sequence, something not
423	   otherwise done in IDNA2008 because the NFC requirement serves the
424	   same purpose.  Today, the only thing that can be excluded without the
425	   potential disruption of disallowing a previously-PVALID combining
426	   sequence is the to exclude the newly-added code point so whatever is
427	   done, or might have been contemplated with hindsight, will be
428	   somewhat inconsistent.

430	3.2.2.  Other examples of the same behavior within the Arabic Script

432	   One of the things that complicates the issue with the new U+08A1 code
433	   point is that there are several other Arabic-script code points that
434	   behave in the same way for similar language-specific reasons.

436	   In particular, at least three other grapheme clusters that have been
437	   present for many version of Unicode can be seen as involving issues
438	   similar to those for the newly-added ARABIC LETTER BEH WITH HAMZA
439	   ABOVE.  ARABIC LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER
440	   REH WITH HAMZA ABOVE (U+076C) do not have decomposition forms and are
441	   preferred over combining sequences using HAMZA ABOVE (U+0654)
442	   [Unicode70-Hamza].  By contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE
443	   (U+0623) decomposes into \u'0627'\u'0654', ARABIC LETTER WAW WITH
444	   HAMZA ABOVE (U+0624) decomposes into \u'0648'\u'0654', and ARABIC
445	   LETTER YEH WITH HAMZA ABOVE (U+0626) decomposes into \u'064A'\u'0654'
446	   so the precomposed character and combining sequences compare equal
447	   when both are normalized, as this specification prefers.

449	   There are other variations in which a precomposed character involving
450	   HAMZA ABOVE has a decomposition to a combining sequence that can form
451	   it.  For example, ARABIC LETTER U WITH HAMZA ABOVE (U+0677) has a
452	   compatibility decomposition. but not a canonical one, into the
453	   combining sequence \u'06C7'\u'0674'.

455	3.2.3.  Hamza and Combining Sequences

457	   As the Unicode Standard points out at some length [Unicode70-Arabic],
458	   Hamza is a problematic abstract character and the "Hamza Above"
459	   construction even more so [Unicode70-Hamza].  Those sections explain
460	   a distinction made by Unicode between the use of a Hamza mark to
461	   denote a glottal stop and one used as a diacritic mark to denote a
462	   separate letter.  In the first case, the combining sequence is used.
463	   In the second, a precomposed character is assigned.

465	   Unlike Unicode generally and because of concerns about identifier
466	   spoofing and attacks based on similarities, character distinctions in
467	   IDNA are based much more strictly on the appearance of characters;
468	   language and pronunciation distinctions within a script are not
469	   considered.  So, for IDNA, BEH WITH HAMZA ABOVE is not-quite-
470	   tautologically the same as BEH WITH HAMZA ABOVE, even if one of them
471	   is written as U+08A1 (new to Unicode 7.0.0) and the other as the
472	   sequence \u'0628'\u'0654' (feasible with Unicode 7.0.0 but also
473	   available in versions of Unicode going back at least to the version
474	   [Unicode32] used in the original version of IDNA [RFC3490].  Because
475	   the precomposed form and combining sequence are, for IDNA purposes,
476	   the same, IDNA expects that normalization (specifically the
477	   requirement that all U-labels be in NFC form) will cause them to
478	   compare equal.

480	   If Unicode also considered them the same, then the principle would
481	   apply that new precomposed ("composition") forms are not added unless
482	   one of the code points that could be used to construct it did not
483	   exist in an earlier version (and even then is discouraged)
484	   [UAX15-Versioning].  When exceptions are made, they are expected to
485	   conform to the rules and classes in the "Composition Exclusion
486	   Table", with class 2 being relevant to this case [UAX15-Exclusion].
487	   That rule essentially requires that the normalization for the old
488	   combining sequence to itself be retained (for stability) but that the
489	   newly-added character be treated as canonically decomposable and
490	   decompose back to the older sequence even under NFC.  That was not
491	   done for this particular case, presumably because of the distinction
492	   about pronunciation modifiers versus separate letters noted above.
493	   Because, for IDNA and the DNS, there is a possibility that the
494	   composing sequence \u'0628'\u'0654' already appears in labels, the
495	   only choice other than allowing an otherwise-identical, and
496	   identically-appearing, label with U+08A1 substituted to identify a
497	   different DNS entry is to DISALLOW the new character.

499	3.3.  Precomposed characters without decompositions more generally

501	3.3.1.  Description of the general problem

503	   As mentioned above, IDNA made a strong assumption that, if there were
504	   two ways to form the same abstract character in the same script,
505	   normalization would result in them comparing equal.  Work on IDNA2008
506	   recognized that early version of Unicode might also contain some
507	   inconsistencies; see Section 3.3.2.4 below.

509	   Having precomposed code points exist that don't have decompositions,
510	   or having them allocated in the future, is problematic for those IDNA
511	   assumptions about character comparison, and seems to call for either
512	   excludng some set of code points that IDNA's rules do not now
513	   identify, to develop and use a normalization procedure that behaves
514	   as expected (those two options may be nearly equivalent for many
515	   purposes) or deciding to accept a risk that, apparently, will only
516	   increase over time.

518	   It is not clear whether the reasons the IDNABIS WG did not understand
519	   and allow for these cases are important except insofar as they inform
520	   considerations about what to do in the future.  It seemed (and still
521	   seems to some people) that the Unicode Standard is very clear on the
522	   matter.  In addition to the normalization stability rules cited in
523	   the last part of Section 3.1. the discussion in the Core Standard
524	   seems quite clear.  For example, "Where characters are used in
525	   different ways in different languages, the relevant properties are
526	   normally defined outside the Unicode Standard" in Section 2.2,
527	   subsection titled "Semantics" [Unicode7] did not suggest to most
528	   readers that sometime separate code points would be allocated within
529	   a script based on language considerations.  Similarly, the same
530	   section of the Standard says, in a subsection titled "Unification",
531	   "The Unicode Standard avoids duplicate encoding of characters by
532	   unifying them within scripts across language" and does not list
533	   exceptions to that rule or limit it to a single script although it
534	   goes on to list "CJK" as an example.  Another subsection, "Equivalent
535	   Sequences" indicates "Common precomposed forms ... are included for
536	   compatibility with current standards.  For static precomposed forms,
537	   the standard provides a mapping to an equivalent dynamically composed
538	   sequence of characters".  The latter appears to be precisely the "all
539	   precomposed characters decompose into the relevant combining
540	   sequences if the relevant base and combining characters exist in the
541	   Standard" that IDNA needs and assumed and, again, there is no mention
542	   of exceptions, language-dependent of otherwise.  The summary of
543	   stabiiity policies cited in the Standard [Unicode70-Stability] does
544	   not appear to shed any additional light on these issues.

546	   The Standard now contains a subsection titled "Non-decomposition of
547	   Overlaid Diacritics" [Unicod70-Overlay] that identifies a list of
548	   diacritics that do not normally form characters that have
549	   decompositions.  The rule given has its own exceptions and the text
550	   clearly states that there is actually no way to know whether a code
551	   point has a decomposition other than consulting the Unicode Character
552	   Database entry for that code point.  The subsequent section notes
553	   that this can be a security problem; while the issues with IDNA go
554	   well beyond what is normally considered security, that comment now
555	   seems clear.  While that subsection is helpful in explaining the
556	   problem, especially for European scripts, it does not appear in the
557	   Unicode versions that were current when IDNA2008 was being developed.

559	3.3.2.  Latin Examples and Cases

561	   While this set of problems was discovered because of a code point
562	   added to the Arabic script in precombined form to support a
563	   particular language, there are actually far more examples for, e.g.,
564	   Latin script than there are for Arabic script.  Many of them are
565	   associated with the "non-decomposition of combining diacriticals"
566	   issues mentioned above, but the next subsections describe other cases
567	   that are not directly bound to decomposition.

569	3.3.2.1.  The font exclusion and compatability relationships

571	   Unicode contains a large collection of characters that are identified
572	   as "Mathematical Symbols".  A large subset of them are basic or
573	   decorated Latin characters, differing from the ordinary ones only by
574	   their usage and, in appearance, by font or type styling (despite the
575	   general principle that font distinctions are not used as the basis
576	   for assigning separate code points.  Most of these have canonical
577	   mappings to the base form, which eliminates them from IDNA, but
578	   others do not and, because the same marks that are used as phonetic
579	   diacritical markings in conventional alphabetical use have special
580	   mathematical meanings, applications that permit the use of these
581	   characters have their own issues with normalization and equality.

583	3.3.2.2.  The phonetic notation characters and extensions

585	   Another example involves various Phonetic Alphabet and Extension
586	   characters. many of which, unlike the Mathematical ones, do not have
587	   normalizations that would make them compare equal to the basic
588	   characters with essentially identical representations.  This would
589	   not be a problem for IDNA if they were identified with a specialize
590	   script or as symbols rather than letters, but neither is the case:
591	   they are generally identified as lower case Latin Script letters even
592	   when they are visually upper-case, another issue for IDNA.

594	3.3.2.3.  Combineng dots and other shapes combine... unless...

596	   The discussion of "Non-decomposition of Overlaid Diacritics"
597	   [Unicod70-Overlay] indirectly exhibits at least one reason why it has
598	   been difficult to characterize the problem.  If one combines that
599	   subsection with others, one gets a set of rules that might be
600	   described as:

602	   1.  If the precomposed character and the code points that make up the
603	       combining sequence exist, then canonical composition and
604	       decomposition work as expected, except...

606	   2.  If the precomposed character was added to Unicode after the code
607	       points that make up the combining sequence, normalization
608	       stability for the combining sequences requires that NFC applied
609	       to the precomposed character decomposes rather than having the
610	       combining sequence compose to the new character, however...

612	   3.  If the combining sequence involves a diacritic or other mark that
613	       actually touches the base character when composed, the
614	       precomposed character does not have a decomposition, unless...

616	   4.  The combining diacritic involved is Cedilla (U+0327), Ogonek
617	       (U+0328), or Horn (U+031B), in which case the precomposed
618	       characters that contain them "regularly" (but presumably not
619	       always), and...

621	   5.  There are further exceptions for Hamza (which does not overlay
622	       the associated base character in the same way the Latin-derived
623	       combining diacritics and other marks do.  Those decisions to
624	       decompose a precomposed character (or not) are based on language
625	       or phonetic considerations, not the combining mechanism or
626	       appearance, or perhaps,...

628	   6.  Some characters have compatibility decompositions rather than
629	       canonical ones [Unicod70-CompatDecomp].  Because compatibility
630	       relationships are treated differently by IDNA, PRECIS
631	       [PRECIS-Framework], and, potentially, other protocols involving
632	       identifiers for Internet use, the existence of compatibility
633	       relationship may or may not be helpful.  Finally,...

635	   7.  There is no reason to believe the above list is complete.  In
636	       particular, if whether a precomposed character decomposes or not
637	       is determined by language or phonetic distinctions, one may need
638	       additional rules on a per-script and/or per-character basis.

640	   The above list only covers the cases involving combining sequences.
641	   It does not cover cases such as those in Section 3.3.2.1 and
642	   Section 3.3.2.2 and there may be additional groups of cases not yet
643	   identified.

645	3.3.2.4.  "Legacy" characters and new additions

647	   The development of categories and rules for IDNA recognized that
648	   early version of Unicode might contain some inconsistencies if
649	   evaluated using more contemporary rules about code point assignments
650	   and stability.  In particular, there might be some exceptions from
651	   different practices in early version of Unicode or anomalies caused
652	   by copying existing single- or dual-script standards into Unicode as
653	   block rather than individual character additions to the repertoire.
654	   The possibility of such "legacy" exceptions was one reason why the
655	   IDNA category rules include explicit provisions for exception lists
656	   (even though no such code points were identified prior to 2014).

658	3.3.3.  Examples and Cases from Other Scripts

660	   Research into these issues has not yet turned up a comprehensive list
661	   of affected scripts and code points.  As discussed elsewhere in this
662	   document, it is clear that Arabic and Latin Scripts are significantly
663	   affected, that some Han and Kangxu radicals and ideographs are
664	   affected, and that other examples do exist -- it is just not known
665	   how many of those examples there are and what patterns, if any,
666	   characterize them.

668	3.3.4.  Scripts with precomposed preferences and ones with combining
669	        preferences

671	   While the authors have been unable to find an explanation for the
672	   differentiation in the Unicode Standard, we have been told that there
673	   are differences among scripts as to whether the action preference is
674	   to add new combining sequences only (and resist adding precomposed
675	   characters) as suggested in Section 3.3.2.3 or to add precomposed
676	   characters, often ones that do not have decompositions.  If those
677	   difference in preference do exist, it is probably important to have
678	   them documented so that they can be reflected in IDNA review
679	   procedures and elsewhere.  It will also require IETF discussion of
680	   whether combining sequences should be deprecated when the
681	   corresponding precomposed characters are added or to disallow
682	   combining sequences entirely for those scripts (as has been
683	   implicitly suggested for Arabic language use [RFC5564]).

685	   [[CREF1: The above isn't quite right and probably needs additional
686	   discussion and text.]]

688	3.4.  Confusion and the casual user

690	   To the extent to which predictability for relatively casual users is
691	   a desired and important feather of relevant application or
692	   application support protocols, it is probably worth observing that
693	   the complex of rules and cases above is almost certainly too involved
694	   for the typical such user to develop a good intuitive understanding
695	   of how things behave and what relationships exist.

697	4.  Implementation options and issues: Unicode properties, exceptions,
698	    and the nature of stability

700	4.1.  Unicode Stability compared to IETF (and ICANN) Stability

702	   The various stability rules in Unicode [Unicode70-Stability] all
703	   appear to be based on the model that once a value is assigned, it can
704	   never be changed.  That is probably appropriate for a character
705	   coding system with multiple uses and applications.  It is probably
706	   the only option when normative relationships are expressed in tables
707	   of values rather than by rules.  One consequence of such a model is
708	   that it is difficult or impossible to fix mistakes (for some
709	   stability rules, the Unicode Standard does provide for exceptions)
710	   and even harder to make adjustments that would normally be dictated
711	   by evolution.

713	   "No changes" provides a very strong and predictable type of stability
714	   and there are many reasons to take that path.  As in some of the
715	   cases that motivated this document, the difficulty is that simply
716	   adding new code points (in Unicode) or features (in a protocol or
717	   application) may be destabilizing.  One then has complete stability
718	   for systems that never use or allow the new code points or features,
719	   but rough edges for newer systems that see the discrepancies and
720	   rough edges.  IDNA2003 (inadvertently) took that approach by freezing
721	   on Unicode 3.2 -- if no code points added after Unicode 3.2 had ever
722	   been allowed, we would have had complete stability even as Unicode
723	   libraries changed.  Unicode has been quite ingenious about working
724	   around those difficulties with such provisions as having code points
725	   for newly-added precomposed characters decompose rather than altering
726	   the normalization for the combining sequences.  Other cases, such as
727	   newly-added precomposed characters that do not decompose for, e.g.,
728	   language or phonetic reasons, are more problematic.

730	   The IETF (and ICANN and standards development bodies such as ISO and
731	   ISO/IEC JTC1) have generally adopted a different type of stability
732	   model, one which considers experience in use and the ill effects of
733	   not making changes as well as the disruptive effects of doing so.  In
734	   the IETF model, if an earlier decision is causing sufficient harm and
735	   there is consensus in the communities that are most affected that a
736	   change is desirable enough to make transition costs acceptable, then
737	   the change is made.

739	   The difference and its implications are perhaps best illustrated by a
740	   disagreement when IDNA2008 was being approved.  IDNA2003 had
741	   effectively prevented some characters, notably (measured by intensity
742	   of the protests) the Sharp S character (U+00DF) from being used in
743	   DNS labels by mapping them to other characters before conversion to
744	   ACE form.  It has also prohibited some other code points, notably ZWJ
745	   (U+200D) and ZWNJ (U+200C), by discarding them.  In both cases, there
746	   were strong voices from the relevant language communities, supported
747	   by the registry communities, that the characters were important
748	   enough that it was more desirable to undergo the short-term pain of a
749	   transition and some uncertainty than to continue to exclude those
750	   characters and the IDNA2008 rules and repertoire are consistent with
751	   that preference.  The Unicode Consortium apparently believed that
752	   stability --elimination of any possibility of label invalidation or
753	   different interpretations of the same string-- was more important
754	   than those writing system requirements and community preferences.
755	   That view was expressed through what was effectively a fork in (or
756	   attempt to nullify) the IETF Standard [UTS46] a result that has
757	   probably been worse for the overall Internet than either of the
758	   possible decision choices.

760	4.2.  New Unicode Properties

762	   One suggestion about the way out of these problems would be to create
763	   one or more new Unicode properties, maintained along with the rest of
764	   Unicode, and then incorporated into new or modified rules or
765	   categories in IDNA.  Given the analysis in this document, it appears
766	   that that property (or properties) would need to provide:

768	   1.  Identification of combining characters that, when used in
769	       combining sequences, do not produce decomposable characters.
770	       [[CREF2: Wording on the above is not quite right but, for the
771	       present, maybe the intent is clear.]]

773	   2.  Identification of precomposed characters that might reasonably be
774	       expected to decompose, but that do not.

776	   3.  Identification of character forms that are distinct only because
777	       of language or phonetic distinctions within a script.

779	   4.  Identification of scripts for which precomposed forms are
780	       strongly preferred and combining sequences should either be
781	       viewed as temporary mechanisms until precomposed characters are
782	       assigned or banned entirely.

784	   5.  Identification of code points that represent symbols for
785	       specific, non-language, purposes even if identified as letters or
786	       numerals by their General Property (see Section 3.3.2.2 and
787	       Section 3.3.2.1).

789	   Some of these properties (or characteristics or values of a single
790	   property) would be suitable for disallowing characters, code points,
791	   or contextual sequences that otherwise might be allowed by IDNA.
792	   Others would be more suitable for making equality comparisons come
793	   out as needed by IDNA, particularly to eliminate distinctions based
794	   on language context.

796	   While it would appear that appropriate rules and categories could be
797	   developed for IDNA (and, presumably, for PRECIS, etc.) if the problem
798	   areas are those identified in this document, it is not yet known
799	   whether the list is complete (and, hence, whether additional
800	   properties or information would be needed.

802	   Even with such properties, IDNA would still almost certainly need
803	   exception lists.  In addition, it is likely that stability rules for
804	   those properties would need to reflect IETF norms with arrangements
805	   for bringing the IETF and other communities into the discussion when
806	   tradeoffs are reviewed.

808	4.3.  The need for exception lists

810	   [[CREF3: Note in draft: this section is a partial placeholder and may
811	   need more elaboration.]]
812	   Issues with exception lists and the requirements for them are
813	   discussed in Section 2 above and RFC 5894 [RFC5894].

815	5.  Proposed/ Alternative Changes to RFC 5892 for the issues first
816	    exposed by new code point U+08A1

818	   NOTE IN DRAFT: See the comments in the Introduction, Section 1 and
819	   the first paragraph of each Subsection below for the status of the
820	   Subsections that follow.  Each one, in combination with the material
821	   in Section 3 above, also provides information about the reasons why
822	   that particular strategy might or might not be appropriate.

824	5.1.  Disallow This New Code Point

826	   This option is almost certainly too Arabic-specific and does not
827	   solve, or even address, the underlying problem.  It also does not
828	   inherently generalize to non-decomposing precomposed code points that
829	   might be added in the future (whether to Arabic or other scripts)
830	   even though one could add more code points to Category F in the same
831	   way.

833	   If chosen by the community, this subsection would update the portion
834	   of the IDNA2008 specification that identifies rules for what
835	   characters are permitted [RFC5892] to disallow that code point.

837	   With the publication of this document, Section 2.6 ("Exceptions (F)")
838	   of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in
839	   Category F so that the rule itself reads:

841	      F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660,
842	                   0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668,
843	                   0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6,
844	                   06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 08A1, 0F0B,
845	                   3007, 302E, 302F, 3031, 3032, 3033, 3034, 3035,
846	                   303B, 30FB}

848	   and then add to the subtable designated
849	   "DISALLOWED -- Would otherwise have been PVALID"
850	   after the line that begins "07FA", the additional line:

852	      08A1; DISALLOWED # ARABIC LETTER BEH WITH HAMZA ABOVE

854	   This has the effect of making the cited code point DISALLOWED
855	   independent of application of the rest of the IDNA rule set to the
856	   current version of Unicode.  Those wishing to create domain name
857	   labels containing Beh with Hamza Above may continue to use the
858	   sequence

860	      U+0628, ARABIC LETTER BEH
861	      followed by

863	      U+0654, ARABIC HAMZA ABOVE

865	   which was valid for IDNA purposes in Unicode 5.0 and earlier and
866	   which continues to be valid.

868	   In principle, much the same thing could be accomplished by using the
869	   IDNA "BackwardCompatible" category (IDNA Category G, RFC 5892
870	   Section 5.3).  However, that category is described as applying only
871	   when "property values in versions of Unicode after 5.2 have changed
872	   in such a way that the derived property value would no longer be
873	   PVALID or DISALLOWED".  Because U+08A1 is a newly-added code point in
874	   Unicode 7.0.0 and no property values of code points in prior versions
875	   have changed, category G does not apply.  If that section of RFC 5892
876	   were to be replaced in the future, perhaps consideration should be
877	   given to adding Normalization Stability and other issues to that
878	   description but, at present, it is not relevant.

880	5.2.  Disallow This New Code Point and All Future Precomposed Additions
881	      that do not decompose

883	   At least in principle, the approach suggested above (Section 5.1)
884	   could be expanded to disallow all future allocations of non-
885	   decomposing precomposed characters.  This would probably require
886	   either a new Unicode property to identify such characters and/or more
887	   emphasis on the manual, individual code point, checking of the new
888	   Unicode version review proces (i.e,. not just application of the
889	   existing rules and algorithm).  It might require either a new rule in
890	   IDNA or a modification to the structure of Category F to make
891	   additions less tedious.  It would do nothing for different ways to
892	   form identical characters within the same script that were not
893	   associated with decomposition and so would have to be used in
894	   conjunction with other appropaches.  Finally, for scripts (such as
895	   Arabic) where there is a very strong preference to avoid combining
896	   sequences, this approach would exclude exactly the wrong set of
897	   characters.

899	5.3.  Disallow the combining sequences for these characters

901	   As in the approach discussed in Section 5.1, this approach is too
902	   Arabic-specific to address the more general problem.  However, it
903	   illustrates a single-script approach and a possible mechanism for
904	   excluding combining sequences whose handling is connected to language
905	   information (information that, as discussed above, is not relevant to
906	   the DNS).

908	   If chosen by the community, this subsection would update the portion
909	   of the IDNA2008 specification that identifies contextual rules
910	   [RFC5892] to prohibit (combining) Hamza Above (U+0654) in conjunction
911	   with Arabic BEH (U+0628), HAH (U+062D), and REH (U+0631).  Note that
912	   the choice of this option is consistent with the general preference
913	   for precomposed characters discussed above but would ban some labels
914	   that are valid today and that might, in principle, be in use.

916	   The required prohibition could be imposed by creating a new
917	   contextual rule in RFC 5892 to constrain combining sequences
918	   containing Hamza Above.

920	   As the Unicode Standard points out at some length [Unicode70-Arabic],
921	   Hamza is a problematic abstract character and the "Hamza Above"
922	   construction even more so.  IDNA has historically associated
923	   characters whose use is reasonable in some contexts but not others
924	   with the special derived property "CONTEXTO" and then specified
925	   specific, context-dependent, rules about where they may be used.
926	   Because Hamza Above is problematic (and spawns edge cases, as
927	   discussed in the Unicode Standard section cited above), it was
928	   suggested that a contextual rule might be appropriate.  There are at
929	   least two reasons why a contextual rule would not be suitable for the
930	   present situation.

932	   1.  As discussed above, the present situation is a normalization
933	       stability and predictability problem, not a contextual one.  Had
934	       the same issues arisen with a newly-added precomposed character
935	       that could previously be constructed from non-problematic base
936	       and combining characters, it would be even more clearly a
937	       normalization issue and, following the principles discussed there
938	       and particularly in UAX 15 [UAX15-Exclusion], might not have been
939	       assigned at all.

941	   2.  The contextual rule sets are designed around restricting the use
942	       of code points to a particular script or adjacent to particular
943	       characters within that script.  Neither of these cases applies to
944	       the newly-added character even if one could imagine rules for the
945	       use of Hamza Above (U+0654) that would reflect the considerations
946	       of Chapter 8 of Unicode 6.2.  Even had the latter been desired,
947	       it would be somewhat late now -- Hamza Above has been present as
948	       a combining character (U+0654) in many versions of Unicode.
949	       While that section of the Unicode Standard describes the issues,
950	       it does not provide actionable guidance about what to do about it
951	       for cases going forward or when visual identity is important.

953	5.4.  Disallow all Combining Characters for Specific Scripts

955	   [[CREF4: This subsevtion needs to be turned into prose, but the
956	   follow bullet points are probably sufficient to identify the
957	   issues.]]

959	   Might work for Arabic and other "precomposed preference" scripts (see
960	   Section 3.3.4; recommended by the Arabic language community for IDNs
961	   [RFC5564].  Hopeless for Latin.  Backwards incompatible.  No effect
962	   at all on special-use representations of identical characters within
963	   a script (see Section 3.3.2.1 and Section 3.3.2.2).

965	5.5.  Do Nothing Other Than Warn

967	   The recommendation from UTC is to simply warn registries, at all
968	   levels of the tree, to be careful with this set of characters, making
969	   language distinctions within zones.  Because the DNS cannot make or
970	   enforce language distinctions, this suggestion is problematic but it
971	   would avoid having the IETF either invalidating label strings that
972	   are potentially now in use or creating inconsistencies among the
973	   characters that combine with Hamza Above but that also have
974	   precomposed forms that do not have decompositions.  The potential
975	   would still exist for registries to respect the warning and deprecate
976	   such labels if they existed.

978	5.6.  Normalization Form IETF (NFI))

980	   The most radical possibility for the comparison issue would be to
981	   decide that none of the Unicode Normalization Forms specified in UAX
982	   15 [UAX15] are adequate for use with the DNS because, contrary to
983	   their apparent descriptions, normalization tables are actually
984	   determined using language information.  However, use of language
985	   information is unacceptable for IDNA for reasons described elsewhere
986	   in this document.  The remedy would be to define an IETF-specific (or
987	   DNS-specific) normalization form (sometimes called "NFI" in
988	   discussions), building on NFC but adhering strictly to the rule that
989	   normalization causes two different forms of the same character (glyph
990	   image) within the same script to be treated as equal.  In practice
991	   such a form could be implemented for IDNA purposes as an additional
992	   rule within RFC 5892 (and its successors) that constituted an
993	   exception list for the NFC tables.  For this set of characters, the
994	   special IETF normalization form would be equivalent to the exclusion
995	   discussed in Section 5.3 above.

997	   An Internet-specific normalization form, especially if specified
998	   somewhat separately from the IDNA core, would have a small marginal
999	   advantage over the other strategies in this section (or in
1000	   combination with some of them), even though most of the end result
1001	   and much of the implementation would be the same in practice.  While
1002	   the design of IDNA requires that strings be normalized as part of the
1003	   process of determining label validity (and hence before either
1004	   storage of values in the DNS or name resolution), there is an ongoing
1005	   debate about whether normalization should be performed before storing
1006	   a string or putting it on the wire or only when the string is
1007	   actually compared or otherwise used.

1009	   If a normalization procedure with the right properties for the IETF
1010	   was defined, that argument could be bypassed and the best decisions
1011	   made for different circumstances.  The separation would also allow
1012	   better comparison of strings that lack language context in
1013	   applications environments in which the additional processing and
1014	   character classifications of IDNA and/or PRECIS were not applicable.
1015	   Having such a normalization procedure defined outside IDNA would also
1016	   minimize changes to IDNA itself, which is probably an advantage.

1018	   If the new normalizstion form were, in practice, simply an overlay on
1019	   NFC with modifications dictated by exception and/or property lists,
1020	   keeping its definition separate from IDNA would also avoid
1021	   interweaving those exceptions and property lists with the rules and
1022	   categories of IDNA itself, avoiding some unnecessary complexity.

1024	6.  Editorial clarification to RFC 5892

1026	   Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a
1027	   clarification to Appendix A and Section A.1 of RFC 5892.  This
1028	   section of this document updates the RFC to apply that clarification.

1030	   1.  In Appendix A, add a new paragraph after the paragraph that
1031	       begins "The code point...".  The new paragraph should read:

1033	       "For the rule to be evaluated to True for the label, it MUST be
1034	       evaluated separately for every occurrence of the Code point in
1035	       the label; each of those evaluations must result in True."

1037	   2.  In Appendix A, Section A.1, replace the "Rule Set" by

1039	      Rule Set:
1040	        False;
1041	        If Canonical_Combining_Class(Before(cp)) .eq.  Virama Then True;
1042	        If cp .eq. \u200C And
1043	               RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp
1044	          (Joining_Type:T)*(Joining_Type:{R,D})) Then True;

1046	7.  Acknowledgements

1048	   The Unicode 7.0.0 changes were extensively discussed within the IAB's
1049	   Internationalization Program.  The authors are grateful for the
1050	   discussions and feedback there, especially from Andrew Sullivan and
1051	   David Thaler.  Additional information was requested and received from
1052	   Mark Davis and Ken Whistler and while they probably do not agree with
1053	   the necessity of excluding this code point or taking even more
1054	   drastic action as their responsibility is to look at the Unicode
1055	   Consortium requirements for stability, the decision would not have
1056	   been possible without their input.  Thanks to Bill McQuillan and Ted
1057	   Hardie for reading versions of the document carefully enough to
1058	   identify and report some confusing typographical errors.  Several
1059	   experts and reviewers who prefer to remain anonymous also provided
1060	   helpful input and comments on preliminary versions of this document.

1062	8.  IANA Considerations

1064	   When the IANA registry and tables are updated to reflect Unicode
1065	   7.0.0, changes should be made according to the decisions the IETF
1066	   makes about Section 5.

1068	9.  Security Considerations

1070	   From at least one point of view, this document is entirely a
1071	   discussion of a security issue or set of such issues.  While the
1072	   "similar-looking characters" issue that has been a concern since the
1073	   earliest days of IDNs [HomographAttack] and that has driven assorted
1074	   "character confusion" projects [ICANN-VIP], if a user types in a
1075	   string on one device and can get different results that do not
1076	   compare equal when it is typed on a different device (with both
1077	   behaving correctly and both keyboards appearing to be the same and
1078	   for the same script) then all security mechanism that depend on the
1079	   underlying identifiers, including the practical applications of DNS
1080	   response integrity checks DNSSEC [RFC4033] and DNS-embedded public
1081	   key mechanisms [RFC6698], are at risk if different parties, at least
1082	   one of them malicious, obtain some of the identical-appearing and
1083	   identically-typed strings.

1085	   Mechanisms that depend on trusting registration systems (e.g.,
1086	   registries and registrars in the DNS IDN case, see Section 5.5 above)
1087	   are likely to be of only limited utility because fully-qualified
1088	   domains that may be perfectly reasonable at the first level or two of
1089	   the DNS may have differences of this type deep in the tree, into
1090	   levels where name management is weak.  Similar issues obviously apply
1091	   when names are user-selected or unmanaged.

1093	   When the issue is not a deliberate attack but simple accidental
1094	   confusion among similar strings, most of our strategies depend on the
1095	   acceptability of false negatives on matching if there is low risk of
1096	   false positives (see, for example, the discussion of false negatives
1097	   in identifier comparison in Section 2.1 of RFC 6943 [RFC6943]).
1098	   Aspects of that issue appear in, for example, RFC 3986 [RFC3986] and
1099	   the PRECIS effort [PRECIS-Framework].  But, because the cases covered
1100	   here are connected, not just to what the user sees but to what is
1101	   typed and where, there is an increased risk of false positives
1102	   (accidental as well as deliberate).

1104	   [[CREF5: Note in Draft: The paragraph that follows was written for a
1105	   much earlier version of this document.  It is obsolete, but is being
1106	   retained as a placeholder for future developments.]]
1107	   This specification excludes a code point for which the Unicode-
1108	   specified normalization behavior could result in two ways to form a
1109	   visually-identical character within the same script not comparing
1110	   equal.  That behavior could create a dream case for someone intending
1111	   to confuse the user by use of a domain name that looked identical to
1112	   another one, was entirely in the same script, but was still
1113	   considered different.

1115	   Internet Security in areas that involve internationalized identifiers
1116	   that might contain the relevant characters is therefore significantly
1117	   dependent on some effective resolution for the issues identified in
1118	   this document, not just hand waving, devout wishes, or appointment of
1119	   study committees about it.

1121	10.  References

1123	10.1.  Normative References

1125	   [PRECIS-Framework]
1126	              Saint-Andre, P. and M. Blanchet, "PRECIS Framework:
1127	              Preparation, Enforcement, and Comparison of
1128	              Internationalized Strings in Application Protocols",
1129	              February 2015, <https://datatracker.ietf.org/doc/draft-
1130	              ietf-precis-framework/>.

1132	   [RFC5137]  Klensin, J., "ASCII Escaping of Unicode Characters", BCP
1133	              137, RFC 5137, February 2008.

1135	   [RFC5890]  Klensin, J., "Internationalized Domain Names for
1136	              Applications (IDNA): Definitions and Document Framework",
1137	              RFC 5890, August 2010.

1139	   [RFC5892]  Faltstrom, P., "The Unicode Code Points and
1140	              Internationalized Domain Names for Applications (IDNA)",
1141	              RFC 5892, August 2010.

1143	   [RFC5892Erratum]
1144	              "RFC5892, "The Unicode Code Points and Internationalized
1145	              Domain Names for Applications (IDNA)", August 2010, Errata
1146	              ID: 3312", Errata ID 3312, August 2012,
1147	              <http://www.rfc-editor.org/errata_search.php?rfc=5892>.

1149	   [RFC5894]  Klensin, J., "Internationalized Domain Names for
1150	              Applications (IDNA): Background, Explanation, and
1151	              Rationale", RFC 5894, August 2010.

1153	   [RFC6943]  Thaler, D., "Issues in Identifier Comparison for Security
1154	              Purposes", RFC 6943, May 2013.

1156	   [UAX15]    Davis, M., Ed., "Unicode Standard Annex #15: Unicode
1157	              Normalization Forms", June 2014,
1158	              <http://www.unicode.org/reports/tr15/>.

1160	   [UAX15-Exclusion]
1161	              "Unicode Standard Annex #15: ob. cit., Section 5",
1162	              <http://www.unicode.org/reports/
1163	              tr15/#Primary_Exclusion_List_Table>.

1165	   [UAX15-Versioning]
1166	              "Unicode Standard Annex #15, ob. cit., Section 3",
1167	              <http://www.unicode.org/reports/tr15/#Versioning>.

1169	   [UTS46]    Davis, M. and M. Suignard, "Unicode Technical Standard
1170	              #46: Unicode IDNA Compatibility Processing", Version
1171	              7.0.0, June 2014, <http://unicode.org/reports/tr46/>.

1173	   [Unicod70-CompatDecomp]
1174	              "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1175	              2.3: Compatibility Characters", Chapter 2, 2014,
1176	              <http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>.

1178	              Subsection titled "Compatibility Decomposable Characters"
1179	              starting on page 26.

1181	   [Unicod70-Overlay]
1182	              "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1183	              2.2: Unicode Design Principles", Chapter 2, 2014,
1184	              <http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>.

1186	              Subsection titled "Non-decomposition of Overlaid
1187	              Diacritics" starting on page 64.

1189	   [Unicode5]
1190	              The Unicode Consortium, "The Unicode Standard, Version
1191	              5.0", ISBN 0-321-48091-0, 2007.

1193	              Boston, MA, USA: Addison-Wesley.  ISBN 0-321-48091-0.
1194	              This printed reference has now been updated online to
1195	              reflect additional code points.  For code points, the
1196	              reference at the time RFC 5890-5894 were published is to
1197	              Unicode 5.2.

1199	   [Unicode62]
1200	              The Unicode Consortium, "The Unicode Standard, Version
1201	              6.2.0", ISBN 978-1-936213-07-8, 2012,
1202	              <http://www.unicode.org/versions/Unicode6.2.0/>.

1204	              Preferred citation: The Unicode Consortium.  The Unicode
1205	              Standard, Version 6.2.0, (Mountain View, CA: The Unicode
1206	              Consortium, 2012.  ISBN 978-1-936213-07-8)

1208	   [Unicode7]
1209	              The Unicode Consortium, "The Unicode Standard, Version
1210	              7.0.0", ISBN 978-1-936213-09-2, 2014,
1211	              <http://www.unicode.org/versions/Unicode7.0.0/>.

1213	              Preferred Citation: The Unicode Consortium.  The Unicode
1214	              Standard, Version 7.0.0, (Mountain View, CA: The Unicode
1215	              Consortium, 2014.  ISBN 978-1-936213-09-2)

1217	   [Unicode70-Arabic]
1218	              "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1219	              9.2: Arabic", Chapter 9, 2014,
1220	              <http://www.unicode.org/versions/Unicode7.0.0/ch09.pdf>.

1222	              Subsection titled "Encoding Principles", paragraph
1223	              numbered 4, starting on page 362.

1225	   [Unicode70-Design]
1226	              "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1227	              2.2: Unicode Design Principles", Chapter 2, 2014,
1228	              <http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>.

1230	   [Unicode70-Hamza]
1231	              "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1232	              9.2: Arabic", Chapter 9, 2014,
1233	              <http://www.unicode.org/versions/Unicode7.0.0/ch09.pdf>.

1235	              Subsection titled "Combining Hamza Above" starting on page
1236	              378.

1238	   [Unicode70-Stability]
1239	              "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1240	              2.2: Unicode Design Principles", Chapter 2, 2014,
1241	              <http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>.

1243	              Subsection titled "Stability" starting on page 23 and
1244	              containing a link to http://www.unicode.org/policies/
1245	              stability_policy.html..

1247	10.2.  Informative References

1249	   [Dalby]    Dalby, A., "Dictionary of Languages: The definitive
1250	              reference to more than 400 languages", Columbia Univeristy
1251	              Press , 2004.

1253	              pages 206-207

1255	   [Daniels]  Daniels, P. and W. Bright, "The World's Writing Systems",
1256	              Oxford University Press , 1986.

1258	   [HomographAttack]
1259	              Gabrilovich, E. and A. Gontmakher, "The Homograph Attack",
1260	              Communications of the ACM 45(2):128, February 2002,
1261	              <http://www.cs.technion.ac.il/~gabr/papers/
1262	              homograph_full.pdf>.

1264	   [ICANN-VIP]
1265	              ICANN, "The IDN Variant Issues Project: A Study of Issues
1266	              Related to the Management of IDN Variant TLDs (Integrated
1267	              Issues Report)", February 2012,
1268	              <https://www.icann.org/en/system/files/files/idn-vip-
1269	              integrated-issues-final-clean-20feb12-en.pdf>.

1271	   [Omniglot-Fula]
1272	              Ager, S., "Omniglot: Fula (Fulfulde, Pulaar,
1273	              Pular'Fulaare)",
1274	              <http://www.omniglot.com/writing/fula.htm>.

1276	              Captured 2015-01-07

1278	   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
1279	              "Internationalizing Domain Names in Applications (IDNA)",
1280	              RFC 3490, March 2003.

1282	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1283	              Resource Identifier (URI): Generic Syntax", STD 66, RFC
1284	              3986, January 2005.

1286	   [RFC4033]  Arends, R., Austein, R., Larson, M., Massey, D., and S.
1287	              Rose, "DNS Security Introduction and Requirements", RFC
1288	              4033, March 2005.

1290	   [RFC5564]  El-Sherbiny, A., Farah, M., Oueichek, I., and A. Al-Zoman,
1291	              "Linguistic Guidelines for the Use of the Arabic Language
1292	              in Internet Domains", RFC 5564, February 2010.

1294	   [RFC6452]  Faltstrom, P. and P. Hoffman, "The Unicode Code Points and
1295	              Internationalized Domain Names for Applications (IDNA) -
1296	              Unicode 6.0", RFC 6452, November 2011.

1298	   [RFC6698]  Hoffman, P. and J. Schlyter, "The DNS-Based Authentication
1299	              of Named Entities (DANE) Transport Layer Security (TLS)
1300	              Protocol: TLSA", RFC 6698, August 2012.

1302	   [Unicode32]
1303	              The Unicode Consortium, "The Unicode Standard, Version
1304	              3.2.0", .

1306	              The Unicode Standard, Version 3.2.0 is defined by The
1307	              Unicode Standard, Version 3.0 (Reading, MA, Addison-
1308	              Wesley, 2000.  ISBN 0-201-61633-5), as amended by the
1309	              Unicode Standard Annex #27: Unicode 3.1
1310	              (http://www.unicode.org/reports/tr27/) and by the Unicode
1311	              Standard Annex #28: Unicode 3.2
1312	              (http://www.unicode.org/reports/tr28/).

1314	Appendix A.  Change Log

1316	   RFC Editor: Please remove this appendix before publication.

1318	A.1.  Changes from version -00 to -01

1320	   o  Version 01 of this document is an extensive rewrite and
1321	      reorganization, reflecting discussions with UTC members and adding
1322	      three more options for discussion to the original proposal to
1323	      simply disallow the new code point.

1325	A.2.  Changes from version -01 to -02

1327	   Corrected a typographical error in which Hamza Above was incorrectly
1328	   listed with the wrong code point.

1330	A.3.  Changes from version -02 to -03

1332	   Corrected a typographical error in the Abstract in which RFC 5892 was
1333	   incorrectly shown as 5982.

1335	A.4.  Changes from version -03 to -04

1337	   o  Explicitly identified the applicability of U+08A1 with Fula and
1338	      added references that discuss that language and how it is written.

1340	   o  Updated several Unicode 6.2 references to point to Unicode 7.0
1341	      since the latter is now available in stable form (it was done when
1342	      work on this I-D started).

1344	   o  Extensively revised to discuss the non-Arabic cases, non-
1345	      decomposing diacritics, other types of characters that don't
1346	      compare equal after normalization, and more general problem and
1347	      approaches.

1349	Authors' Addresses

1351	   John C Klensin
1352	   1770 Massachusetts Ave, Ste 322
1353	   Cambridge, MA  02140
1354	   USA

1356	   Phone: +1 617 245 1457
1357	   Email: john-ietf@jck.com

1359	   Patrik Faltstrom
1360	   Netnod
1361	   Franzengatan 5
1362	   Stockholm  112 51
1363	   Sweden

1365	   Phone: +46 70 6059051
1366	   Email: paf@netnod.se