idnits 2.17.1 

draft-klensin-idna-5892upd-unicode70-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == It seems as if not all pages are separated by form feeds - found 34 form
     feeds but 744 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 1198: '...ated to True for the label, it MUST be...'

  -- The draft header indicates that this document updates RFC5892, but the
     abstract doesn't seem to directly say this.  It does mention RFC5892
     though, so this could be OK.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

     (Using the creation date from RFC5892, updated by this document, for
     RFC5378 checks: 2008-04-26)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (October 8, 2017) is 2392 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Duplicate reference: RFC5892, mentioned in 'RFC5892Erratum', was also
     mentioned in 'RFC5892'.

  ** Downref: Normative reference to an Informational RFC: RFC 5894

  ** Downref: Normative reference to an Informational RFC: RFC 6943

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Exclusion'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'UAX15-Versioning'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode5'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode7'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'Unicode70-Arabic'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'Unicode70-CompatDecomp'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'Unicode70-Design'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode70-Hamza'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'Unicode70-Overlay'

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'Unicode70-Stability'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTS46'

  -- Obsolete informational reference (is this intentional?): RFC 3490
     (Obsoleted by RFC 5890, RFC 5891)


     Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 18 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                         J. Klensin
3	Internet-Draft
4	Updates: 5892, 5894 (if approved)                           P. Faltstrom
5	Intended status: Standards Track                                  Netnod
6	Expires: April 11, 2018                                  October 8, 2017

8	             IDNA Update for Unicode 7.0 and Later Versions
9	                draft-klensin-idna-5892upd-unicode70-05

11	Abstract

13	   The current version of the IDNA specifications anticipated that each
14	   new version of Unicode would be reviewed to verify that no changes
15	   had been introduced that required adjustments to the set of rules
16	   and, in particular, whether new exceptions or backward compatibility
17	   adjustments were needed.  The review for Unicode 7.0.0 first
18	   identified a potentially problematic new code point and then a much
19	   more general and difficult issue with Unicode normalization.  This
20	   specification discusses those issues and proposes updates to IDNA
21	   and, potentially, the way the IETF handles comparison of identifiers
22	   more generally, especially when there is no associated language or
23	   language identification.  It also applies an editorial clarification
24	   to RFC 5892 that was the subject of an earlier erratum and updates
25	   RFC 5894 to point to the issues involved.

27	Status of This Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at https://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on April 11, 2018.

44	Copyright Notice

46	   Copyright (c) 2017 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (https://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
62	     1.1.  Origins and Discovery of the Issue  . . . . . . . . . . .   4
63	     1.2.  IDNA2008 and Special or Exceptional Cases . . . . . . . .   5
64	     1.3.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   7
65	   2.  Document Aspirations  . . . . . . . . . . . . . . . . . . . .   8
66	   3.  Problem Description . . . . . . . . . . . . . . . . . . . . .   8
67	     3.1.  IDNA assumptions about Unicode normalization  . . . . . .   8
68	     3.2.  The discovery and the Arabic script cases . . . . . . . .  10
69	       3.2.1.  New code point U+08A1, decomposition, and language
70	               dependency  . . . . . . . . . . . . . . . . . . . . .  10
71	       3.2.2.  Other examples of the same behavior within the Arabic
72	               Script  . . . . . . . . . . . . . . . . . . . . . . .  11
73	       3.2.3.  Hamza and Combining Sequences . . . . . . . . . . . .  11
74	     3.3.  Precomposed characters without decompositions more
75	           generally . . . . . . . . . . . . . . . . . . . . . . . .  12
76	       3.3.1.  Description of the general problem  . . . . . . . . .  12
77	       3.3.2.  Latin Examples and Cases  . . . . . . . . . . . . . .  14
78	         3.3.2.1.  The font exclusion and compatability
79	                   relationships . . . . . . . . . . . . . . . . . .  14
80	         3.3.2.2.  The phonetic notation characters and extensions .  14
81	         3.3.2.3.  The stroke (solidus) ambiguity  . . . . . . . . .  14
82	           3.3.2.3.1.  Combining dots and other shapes combine...
83	                       unless... . . . . . . . . . . . . . . . . . .  15
84	           3.3.2.3.2.  "Legacy" characters and new additions . . . .  16
85	       3.3.3.  Unexpected Combining Sequances  . . . . . . . . . . .  16
86	       3.3.4.  Examples and Cases from Other Scripts . . . . . . . .  17
87	         3.3.4.1.  Scripts with precomposed preferences and ones
88	                   with combining preferences  . . . . . . . . . . .  17
89	         3.3.4.2.  The Han and Kangxu Cases  . . . . . . . . . . . .  17
90	     3.4.  Confusion and the Casual User . . . . . . . . . . . . . .  17
91	   4.  Implementation options and issues: Unicode properties,
92	       exceptions, and the nature of stability . . . . . . . . . . .  18
93	     4.1.  Unicode Stability compared to IETF (and ICANN) Stability   18
94	     4.2.  New Unicode Properties  . . . . . . . . . . . . . . . . .  19
95	     4.3.  The need for exception lists  . . . . . . . . . . . . . .  20
96	   5.  Proposed/ Alternative Changes to RFC 5892 for the issues
97	       first exposed by new code point U+08A1  . . . . . . . . . . .  20
98	     5.1.  Disallow This New Code Point  . . . . . . . . . . . . . .  20
99	     5.2.  Disallow This New Code Point and All Future Precomposed
100	           Additions that Do Not Decompose . . . . . . . . . . . . .  22
101	     5.3.  Disallow the combining sequences for these characters . .  22
102	     5.4.  Use Combinnig Classes to Develop Additional Contextual
103	           Rules . . . . . . . . . . . . . . . . . . . . . . . . . .  23
104	     5.5.  Disallow all Combining Characters for Specific Scripts  .  23
105	     5.6.  Do Nothing Other Than Warn  . . . . . . . . . . . . . . .  24
106	     5.7.  Normalization Form IETF (NFI))  . . . . . . . . . . . . .  25
107	   6.  Editorial clarification to RFC 5892 . . . . . . . . . . . . .  26
108	   7.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  26
109	   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  26
110	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  27
111	   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  28
112	     10.1.  Normative References . . . . . . . . . . . . . . . . . .  28
113	     10.2.  Informative References . . . . . . . . . . . . . . . . .  30
114	   Appendix A.  Change Log . . . . . . . . . . . . . . . . . . . . .  33
115	     A.1.  Changes from version -00 (2014-07-21)to -01 . . . . . . .  33
116	     A.2.  Changes from version -01 (2014-12-07) to -02  . . . . . .  33
117	     A.3.  Changes from version -02 (2014-12-07) to -03  . . . . . .  33
118	     A.4.  Changes from version -03 (2015-01-06) to -04  . . . . . .  33
119	     A.5.  Changes from version -04 (2015-03-11) to -05  . . . . . .  34
120	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  34

122	1.  Introduction

124	      Note in/about -04 and -05 Drafts: These two versions of the
125	      document contains a very large amount of new material as compared
126	      to the -03 version.  The new material reflects an evolution of
127	      community understanding in the first quarter of 2015 and further
128	      evolution between then and mid-2017 from an assumption that the
129	      problem involved only a few code points and one combining
130	      character in a single script (Hamza Above and Arabic) to an
131	      understanding that the problem we have come to call "non-
132	      decomposing code points" and several closely related ones are
133	      quite pervasive and may represent fundamental misunderstandings or
134	      omissions from IDNA2008 (and, by extension, the basics of PRECIS
135	      [RFC8264]) that must be corrected if those protocols are going to
136	      be used in a way that supports internationalized identifiers on
137	      the Internet predictably (as seen by the end user) and securely.

139	      This version is still necessarily incomplete: not only is our
140	      understanding probably still not comprehensive, but there are a
141	      number of placeholders for text and references.  Nonetheless, the
142	      document in its current form should be useful as both the
143	      beginning of a comprehensive overview is the issues and a source
144	      of references to other relevant materials.

146	      This draft could almost certainly be better organized to improve
147	      its readability: specific suggestions would be welcome.

149	1.1.  Origins and Discovery of the Issue

151	   The current version of the IDNA specifications, known as "IDNA2008"
152	   [RFC5890], anticipated that each new version of Unicode would be
153	   reviewed to verify that no changes had been introduced that required
154	   adjustments to IDNA's rules and, in particular, whether new
155	   exceptions or backward compatibility adjustments were needed.  When
156	   that review was carefully conducted for Unicode 7.0.0 [Unicode7],
157	   comparing it to prior versions including the text in Unicode 6.2
158	   [Unicode62], it identified a problematic new code point (U+08A1,
159	   ARABIC LETTER BEH WITH HAMZA ABOVE).  The code point was added for
160	   Arabic Script use with the Fula (also known as Fulfulde, Pulaar, amd
161	   Pular'Fulaare) language.  That language is apparently most often
162	   written in Latin characters today [Omniglot-Fula] [Dalby] [Daniels].

164	   The specific problem is discussed in detail in Section 3.  In very
165	   broad terms, IDNA (and other IETF work) assume that, if one can
166	   represent "the same character" either as a combining sequence or as a
167	   single code point, strings that are identical except for those
168	   alternate forms will compare equal after normalization.  Part of the
169	   difficulty that has characterized this discussion is that "the same"
170	   differs depending on the criteria that are chosen.  It may be further
171	   complicated in practice by differences in preferred type styles or
172	   rendering, but Unicode code point choices are not supposed to depend
173	   on type style (font) variations and, again, IDNA has no mechanism for
174	   specifying language choices that might affect rendering.

176	   The behavior of the newly-added code point, while non-optimal for
177	   IDNA, follows that of a few code points that predate Unicode 7.x and
178	   even the IDNA 2008 specifications and Unicode 6.0.  Those existing
179	   code points, which may not be easy to accurately characterize as a
180	   group, make the question of what, if anything, to do about this new
181	   exceedingly problematic one and, perhaps separately, what to do about
182	   existing sets of code points with the same behavior, because
183	   different reasonable criteria yield different decisions,
184	   specifically:

186	   o  To disallow it (and future, but not existing, characters with
187	      similar characteristics) as an IDNA exception case creates
188	      inconsistencies with how those earlier code points were handled.

190	   o  To disallow it and the similar code points as well would
191	      necessitate invalidating some potential labels that would have
192	      been valid under IDNA2008 until this time.  Depending on how the
193	      collection of similar code points is characterized, a few of them
194	      are almost certainly used in reasonable labels.

196	   o  To permit the new code point to be treated as PVALID creates a
197	      situation in which it is possible, within the same script, to
198	      compose the same character symbol (glyph or grapheme) in two
199	      different ways that do not compare equal even after normalization.
200	      That condition would then apply to it and the earlier code points
201	      with the same behavior.  That situation contradicts a fundamental
202	      assumption of IDNA that is discussed in more detail below.

204	   NOTE IN DRAFT:

206	      This working draft discusses six alternatives, including an idea
207	      (an IETF-specific normalization form) that seemed too drastic to
208	      be considered when IDNA2008 was designed or even when the review
209	      of Unicode 7.0 for IDAN purposes began.  In retrospect, it not
210	      only would have been appropriate to discuss when the IDNA2008
211	      specifications were being developed but is appearing more
212	      attractive now.  The authors suggest that the community discuss
213	      the relevant tradeoffs and make a decision and that the document
214	      then be revised to reflect that decision, with the other
215	      alternatives discussed as options not chosen.  Because there is no
216	      ideal choice, the discussion of the issues in Section 3 is
217	      probably as or more important than the particular choice of how to
218	      handle this code point.  In addition to providing information for
219	      this document, that section should be considered as an updating
220	      addendum to RFC 5894 [RFC5894] and should be incorporated into any
221	      future revision of that document.

223	      As the result of this version of the document containing several
224	      alternate proposals, some of the text is also a little bit
225	      redundant.  That will be corrected in future versions.

227	1.2.  IDNA2008 and Special or Exceptional Cases

229	   IDNA2008 contains several type of explicit provisions for characters
230	   (code points) that require special treatment when the requirements of
231	   the DNS cannot easily be met by calculations based on stable Unicode
232	   properties.  Those provisions are
233	   [[CREF1: ... to be supplied]]

235	   As anticipated when IDNA2008, and RFC 5892 in particular, were
236	   written, exceptions and explicit updates are likely to be needed only
237	   if there is disagreement between the Unicode Consortium's view about
238	   what is best for the Standard and its very diverse user community and
239	   the IETF's view of what is best for IDNs, the DNS, and IDNA.  It was
240	   hoped that a situation would never arise in which the the two
241	   perspectives would disagree, but the possibility was anticipated and
242	   considerable mechanism added to RFC 5890 and 5982 as a result.  It is
243	   probably important to note that a disagreement in this context does
244	   not imply that anyone is "wrong", only that the two different groups
245	   have different needs and therefore criteria about what is acceptable.
246	   In particular, it appears that the Unicode Consortium has made
247	   assumptions about the availability (by explicit designation or
248	   context) of information about applicable languages or other context
249	   for a give string that are not possible for IDNA.  For that reason,
250	   the IETF has, in the past, allowed some characters for IDNA that
251	   active Unicode Technical Committee members suggested be disallowed to
252	   avoid a change in derived tables [RFC6452].  This document describes
253	   a set of cases for which the IETF must consider disallowing sets of
254	   characters that the various properties would otherwise treat as
255	   PVALID.

257	   This document provides the "flagging for the IESG" specified by
258	   Section 5.1 of RFC 5892.  As specified there, the change itself
259	   requires IETF review because it alters the rules of Section 2 of that
260	   document.

262	      [[RFC Editor: please remove the following comment and note if they
263	      get to you.]]

265	      [[IESG: It might not be a bad idea to incorporate some version of
266	      the following into the Last Call announcement.]]

268	      NOTE IN DRAFT to IETF Reviewers: The issues in this document, and
269	      particularly the choices among options for either adding exception
270	      cases to RFC 5892 or ignoring the issue, warning people, and
271	      hoping the results do not include or enable serious problems, are
272	      fairly esoteric.  Understanding them requires that one have at
273	      least some understanding of how scripts in which precomposed
274	      characters are preferred over combining sequences as a Unicode
275	      design and extension principle work.  Those scripts include Arabic
276	      but, unlike the assumption when the issues were first discovered,
277	      are by no means limited to it.  Readers should also understand the
278	      reasons the Unicode Standard gives various Arabic Script
279	      characters a fairly extended discussion [Unicode70-Arabic] but
280	      should treat that only as an example and note that most other
281	      cases are much less well documented.  It also requires
282	      understanding of a number of Unicode principles, including the
283	      Normalization Stability rules [UAX15-Versioning] as applied to new
284	      precomposed characters and guidelines for adding new characters.
285	      There is considerable discussion of the issues in Section 3 and
286	      references are provided for those who want to pursue them, but
287	      potential reviewers should assume that the background needed to
288	      understand the reasons for this change is no less deep in the
289	      subject matter than would be expected of someone reviewing a
290	      proposed change in, e.g., the fundamentals of BGP, TCP congestion
291	      control, or some cryptographic algorithm.  Put more bluntly, one's
292	      ability to read or speak languages other than English, or even one
293	      or more languages that use the Arabic script or other scripts
294	      similarly affected, does not make one an expert in these matters.

296	1.3.  Terminology

298	   This document assumes that the reader is reasonably familiar with the
299	   terminology of IDNA [RFC5890] and Unicode [Unicode7] and with the
300	   IETF conventions for representing Unicode code points [RFC5137].
301	   Some terms used here may not be used in the same way in those two
302	   sets of documents.  From one point of view, those differences may
303	   have been the results of, or led to, misunderstandings that may, in
304	   turn, be part of the root cause of the problems explored in this
305	   document.  In particular, this document uses the term "precomposed
306	   character" to describe characters that could reasonably be composed
307	   by a combining sequence using code points with appropriate appearance
308	   in common type styles but for which a single code point that does not
309	   require combining sequences is available.  That definition is
310	   strictly about mechanical composition and does not involve any
311	   considerations about how the character is used.  It is closely
312	   related to this document's definition of "identical".  When a
313	   precomposed character exists and either applying NFC to the combining
314	   sequence does not yield that character or applying NFD to that
315	   character's code point does not yield the combining sequence, it is
316	   referred to in this document as "non-decomposable".

318	   The document also uses some terms that are familiar to those who have
319	   been involved with IDNs and IDNA for a long time, but uses them more
320	   precisely than may be common in other quarters.  For example, the
321	   term "Punycode" is not used at all in the rest of this document
322	   because it is the name of a very specific encoding algorithm
323	   [RFC3492] that does not incorporate the rules and algorithms for
324	   domain name labels that are produced by that encoding.  Instead, the
325	   generic terms "ACE" or "ACE string" for "ASCII-compatible encoding"
326	   is used to refer to strings that abstractly contain characters
327	   outside the ASCII repertoire [RFC0020] but are encoded so that only
328	   ASCII characters appear in the string that would be encountered by a
329	   user or protocol and the terms "A-label" and "U-label", as defined in
330	   RFC 5890, to refer to the ACE and more conventional (or "native")
331	   character forms in which those non-ASCII characters appear in
332	   conventional Unicode encodings (typically UTF-8).

334	2.  Document Aspirations

336	   This document, in its present form, is not a proposal for a solution.
337	   Instead, it is intended to be (or evolve into) a comprehensive
338	   description of the issues and problems and to outline some possible
339	   approaches to a solution.  A perfect solution -- one that would
340	   resolve all of the issues identified in this document -- would
341	   involve a relatively small set of relatively simple rules and hence
342	   would be comprehensible and predictable for and by non-expert end
343	   users, would not require code point by code point or even block by
344	   block exception lists, and would not leave uses of any script or
345	   language feeling that their particular writing system have been
346	   treated less fairly than others.

348	   Part of the reality we need to accept is that IDNA, in its present
349	   form, represents compromises that does not completely satisfy those
350	   criteria and whatever is done about these issues will probably make
351	   it (or the job of administering zones containing IDNs) more complex.
352	   Similarly, as the Unicode Standard suggests when it identifies ten
353	   Design Principles and the text then says "Not all of these principles
354	   can be satisfied simultaneously..." [Unicode70-Design], while there
355	   are guidelines and principles, a certain amount of subjective
356	   judgment is involved in making determinations about normalization,
357	   decomposition, and some property values.  For Unicode itself, those
358	   issues are resolved by multiple statements (at least one cited below)
359	   that one needs to rely on per-code point information in the Unicode
360	   Character Database rather than on rules or principles.  The design of
361	   IDNA and the effort to keep it largely independent of Unicode
362	   versions requires rules, categories, and principles that can be
363	   relied upon and applied algorithmically.  There is obviously some
364	   tension between the two approaches.

366	3.  Problem Description

368	3.1.  IDNA assumptions about Unicode normalization

370	   IDNA makes several assumptions about Unicode, Unicode "characters",
371	   and the effects of normalization.  Those assumptions were based on
372	   careful reading of the Unicode Standard at the time [Unicode5],
373	   guided by advice and commitments by members of the Unicode Technical
374	   Committee.  Those assumptions, and the associated requirements, are
375	   necessitated by three properties of DNS labels that typically do not
376	   apply to blocks of running text:

378	   1.  There is no language context for a label.  While particular DNS
379	       zones may impose restrictions, including language or script
380	       restrictions, on what labels can be registered, neither the DNS
381	       nor IDNA impose either type of restriction or give the user of a
382	       label any indication about the registration or other restrictions
383	       that may have been imposed.

385	   2.  Labels are often mnemonics rather than words in any language.
386	       They may be abbreviations or acronyms or contain embedded digits
387	       and have other characteristics that are not typical of words.

389	   3.  Labels are, in practice, usually short.  Even when they are the
390	       maximum length allowed by the DNS and IDNA, they are typically
391	       too short to provide significant context.  Statements that
392	       suggest that languages can almost always be determined from
393	       relatively short paragraphs or equivalent bodies of text do not
394	       apply to DNS labels because of their typical short length and
395	       because, as noted above, they are not required to be formed
396	       according to language-based rules.

398	   At the same time, because the DNS is an exact-match system, there
399	   must be no ambiguity about whether two labels are equal.  Although
400	   there have been extensive discussions about "confusingly similar"
401	   characters, labels, and strings, such tests between scripts are
402	   always somewhat subjective: they are affected by choices of type
403	   styles and by what the user expects to see.  In spite of the fact
404	   that the glyphs that represent many characters in different scripts
405	   are identical in appearance (e.g., basic Latin "a" (U+0061) and the
406	   identical-appearing Cyrillic character (U+0430), the most important
407	   test is that, if two glyphs are the same within a given script, they
408	   must represent the same character no matter how they are formed.

410	   Unicode normalization, as explained in [UAX15], is expected to
411	   resolve those "same script, same glyph, different formation methods"
412	   issues.  Within the Latin script, the code point sequence for lower
413	   case "o" (U+006F) and combining diaeresis (U+0308) will, when
414	   normalized using the "NFC" method required by IDNA, produce the
415	   precomposed small letter o with diaeresis (U+00F6) and hence the two
416	   ways of forming the character will compare equal (and the combining
417	   sequence is effectively prohibited from U-labels).

419	   NFC was preferred over other normalization methods for IDNA because
420	   it is more compact, more likely to be produced on keyboards on which
421	   the relevant characters actually appeared, and because it does not
422	   lose substantive information (e.g., some types of compatibility
423	   equivalence involves judgment calls as to whether two characters are
424	   actually the same -- they may be "the same" in some contexts but not
425	   others -- while canonical equivalence is about different ways to
426	   produce the glyph for the same abstract character).

428	   IDNA also assumed that the extensive Unicode stability rules would be
429	   applied and work as specified when new code points were added.  Those
430	   rules, as described in The Unicode Standard and the normative annexes
431	   identified below, provide that:

433	   1.  New code points representing precomposed characters that can be
434	       formed from combining sequences will not be added to Unicode
435	       unless neither the relevant base character nor required combining
436	       character(s) are part of the Standard within the relevant script
437	       [UAX15-Versioning].

439	   2.  If circumstances require that principle be violated,
440	       normalization stability requires that the newly-added character
441	       decompose (even under NFC) to the previously-available combining
442	       sequence [UAX15-Exclusion].

444	   At least at the time IDNA2008 was being developed, there was no
445	   explicit provision in the Standard's discussion of conditions for
446	   adding new code points, nor of normalization stability, for an
447	   exception based on different languages using the same script or
448	   ambiguities about the shape or positioning of combining characters.

450	3.2.  The discovery and the Arabic script cases

452	   While the set of problems with normalization discussed above were
453	   discovered with a newly-added code point for the Arabic Script and
454	   some characteristics of Unicode handling of that script seem to make
455	   the problem more complex going forward, these are not issues specific
456	   to Arabic.  This section describes the Arabic-specific problems;
457	   subsequent ones (starting with Section 3.3) discuss the problem more
458	   generally and include illustrations from other scripts.

460	3.2.1.  New code point U+08A1, decomposition, and language dependency

462	   Unicode 7.0.0 introduces the new code point U+08A1, ARABIC LETTER BEH
463	   WITH HAMZA ABOVE.  As can be deduced from the name, it is visually
464	   identical to the glyph that can be formed from a combining sequence
465	   consisting of the code point for ARABIC LETTER BEH (U+0628) and the
466	   code point for Combining Hamza Above (U+0654).  The two rules
467	   summarized above (see the last part of Section 3.1) suggest that
468	   either the new code point should not be allocated at all or that it
469	   should have a decomposition to \u'0628'\u'0654'.

471	   Had the issues outlined in this document been better understood at
472	   the time, it probably would have been wise for RFC 5892 to disallow
473	   either the precomposed character or the combining sequence of each
474	   pair in those cases in which Unicode normalization rules do not cause
475	   the right thing to happen, i.e., the combining sequence and
476	   precomposed character to be treated as equivalent.  Failure to do so
477	   at the time places an extra burden on registries to be sure that
478	   conflicts (and the potential for confusion and attacks) do not exist.
479	   Oddly, had the exclusion been made part of the specification at that
480	   time, the preference for precomposed forms noted above would probably
481	   have dictated excluding the combining sequence, something not
482	   otherwise done in IDNA2008 because the NFC requirement serves the
483	   same purpose.  Today, the only thing that can be excluded without the
484	   potential disruption of disallowing a previously-PVALID combining
485	   sequence is the to exclude the newly-added code point so whatever is
486	   done, or might have been contemplated with hindsight, will be
487	   somewhat inconsistent.

489	3.2.2.  Other examples of the same behavior within the Arabic Script

491	   One of the things that complicates the issue with the new U+08A1 code
492	   point is that there are several other Arabic-script code points that
493	   behave in the same way for similar language-specific reasons.

495	   In particular, at least three other grapheme clusters that have been
496	   present for many version of Unicode can be seen as involving issues
497	   similar to those for the newly-added ARABIC LETTER BEH WITH HAMZA
498	   ABOVE.  ARABIC LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER
499	   REH WITH HAMZA ABOVE (U+076C) do not have decomposition forms and are
500	   preferred over combining sequences using HAMZA ABOVE (U+0654)
501	   [Unicode70-Hamza].  By contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE
502	   (U+0623) decomposes into \u'0627'\u'0654', ARABIC LETTER WAW WITH
503	   HAMZA ABOVE (U+0624) decomposes into \u'0648'\u'0654', and ARABIC
504	   LETTER YEH WITH HAMZA ABOVE (U+0626) decomposes into \u'064A'\u'0654'
505	   so the precomposed character and combining sequences compare equal
506	   when both are normalized, as this specification prefers.

508	   There are other variations in which a precomposed character involving
509	   HAMZA ABOVE has a decomposition to a combining sequence that can form
510	   it.  For example, ARABIC LETTER U WITH HAMZA ABOVE (U+0677) has a
511	   compatibility decomposition. but not a canonical one, into the
512	   combining sequence \u'06C7'\u'0674'.

514	3.2.3.  Hamza and Combining Sequences

516	   As the Unicode Standard points out at some length [Unicode70-Arabic],
517	   Hamza is a problematic abstract character and the "Hamza Above"
518	   construction even more so [Unicode70-Hamza].  Those sections explain
519	   a distinction made by Unicode between the use of a Hamza mark to
520	   denote a glottal stop and one used as a diacritic mark to denote a
521	   separate letter.  In the first case, the combining sequence is used.
522	   In the second, a precomposed character is assigned.

524	   Unlike Unicode generally and because of concerns about identifier
525	   spoofing and attacks based on similarities, character distinctions in
526	   IDNA are based much more strictly on the appearance of characters;
527	   language and pronunciation distinctions within a script are not
528	   considered.  So, for IDNA, BEH WITH HAMZA ABOVE is not-quite-
529	   tautologically the same as BEH WITH HAMZA ABOVE, even if one of them
530	   is written as U+08A1 (new to Unicode 7.0.0) and the other as the
531	   sequence \u'0628'\u'0654' (feasible with Unicode 7.0.0 but also
532	   available in versions of Unicode going back at least to the version
533	   [Unicode32] used in the original version of IDNA [RFC3490].  Because
534	   the precomposed form and combining sequence are, for IDNA purposes,
535	   the same, IDNA expects that normalization (specifically the
536	   requirement that all U-labels be in NFC form) will cause them to
537	   compare equal.

539	   If Unicode also considered them the same, then the principle would
540	   apply that new precomposed ("composition") forms are not added unless
541	   one of the code points that could be used to construct it did not
542	   exist in an earlier version (and even then is discouraged)
543	   [UAX15-Versioning].  When exceptions are made, they are expected to
544	   conform to the rules and classes in the "Composition Exclusion
545	   Table", with class 2 being relevant to this case [UAX15-Exclusion].
546	   That rule essentially requires that the normalization for the old
547	   combining sequence to itself be retained (for stability) but that the
548	   newly-added character be treated as canonically decomposable and
549	   decompose back to the older sequence even under NFC.  That was not
550	   done for this particular case, presumably because of the distinction
551	   about pronunciation modifiers versus separate letters noted above.
552	   Because, for IDNA and the DNS, there is a possibility that the
553	   composing sequence \u'0628'\u'0654' already appears in labels, the
554	   only choice other than allowing an otherwise-identical, and
555	   identically-appearing, label with U+08A1 substituted to identify a
556	   different DNS entry is to DISALLOW the new character.

558	3.3.  Precomposed characters without decompositions more generally

560	3.3.1.  Description of the general problem

562	   As mentioned above, IDNA made a strong assumption that, if there were
563	   two ways to form the same abstract character in the same script,
564	   normalization would result in them comparing equal.  Work on IDNA2008
565	   recognized that early version of Unicode might also contain some
566	   inconsistencies; see Section 3.3.2.3.2 below.

568	   Having precomposed code points exist that don't have decompositions,
569	   or having code points of that nature allocated in the future, is
570	   problematic for those IDNA assumptions about character comparison.
571	   It seems to call for either excluding some set of code points that
572	   IDNA's rules do not now identify, development and use of a
573	   normalization procedure that behaves as expected (those two options
574	   may be nearly equivalent for many purposes), or deciding to accept a
575	   risk that, apparently, will only increase over time.

577	   It is not clear whether the reasons the IDNABIS WG did not understand
578	   and allow for these cases are important except insofar as they inform
579	   considerations about what to do in the future.  It seemed (and still
580	   seems to some people) that the Unicode Standard is very clear on the
581	   matter (or at least was when IDNA2008 was being developed).  In
582	   addition to the normalization stability rules cited in the last part
583	   of Section 3.1. the discussion in the Core Standard seems quite
584	   clear.  For example, "Where characters are used in different ways in
585	   different languages, the relevant properties are normally defined
586	   outside the Unicode Standard" in Section 2.2, subsection titled
587	   "Semantics" [Unicode7] did not suggest to most readers that sometimes
588	   separate code points would be allocated within a script based on
589	   language considerations.  Similarly, the same section of the Standard
590	   says, in a subsection titled "Unification", "The Unicode Standard
591	   avoids duplicate encoding of characters by unifying them within
592	   scripts across language" and does not list exceptions to that rule or
593	   limit it to a single script although it goes on to list "CJK" as an
594	   example.  Another subsection, "Equivalent Sequences" indicates
595	   "Common precomposed forms ... are included for compatibility with
596	   current standards.  For static precomposed forms, the standard
597	   provides a mapping to an equivalent dynamically composed sequence of
598	   characters".  The latter appears to be precisely the "all precomposed
599	   characters decompose into the relevant combining sequences if the
600	   relevant base and combining characters exist in the Standard" rule
601	   that IDNA needs and assumed and, again, there is no mention of
602	   exceptions, language-dependent of otherwise.  The summary of
603	   stability policies cited in the Standard [Unicode70-Stability] does
604	   not appear to shed any additional light on these issues.

606	   The Standard now contains a subsection titled "Non-decomposition of
607	   Overlaid Diacritics" [Unicode70-Overlay] that identifies a list of
608	   diacritics that do not normally form characters that have
609	   decompositions.  The rule given has its own exceptions and the text
610	   clearly states that there is actually no way to know whether a code
611	   point has a decomposition other than consulting the Unicode Character
612	   Database entry for that code point.  The subsequent section notes
613	   that this can be a security problem.  While the issues with IDNA go
614	   well beyond what is normally considered security, that comment now
615	   seems clear.  While that subsection is helpful in explaining the
616	   problem, especially for European scripts, it does not appear in the
617	   Unicode versions that were current when IDNA2008 was being developed.

619	3.3.2.  Latin Examples and Cases

621	   While this set of problems was discovered because of a code point
622	   added to the Arabic script in precombined form to support a
623	   particular language, there are actually far more examples for, e.g.,
624	   Latin script than there are for Arabic script.  Many of them are
625	   associated with the "non-decomposition of combining diacriticals"
626	   issues mentioned above, but the next subsections describe other cases
627	   that are not directly bound to decomposition.

629	3.3.2.1.  The font exclusion and compatability relationships

631	   Unicode contains a large collection of characters that are identified
632	   as "Mathematical Symbols".  A large subset of them are basic or
633	   decorated Latin characters, differing from the ordinary ones only by
634	   their usage and, in appearance, by font or type styling (despite the
635	   general principle that font distinctions are not used as the basis
636	   for assigning separate code points.  Most of these have canonical
637	   mappings to the base form, which eliminates them from IDNA, but
638	   others do not and, because the same marks that are used as phonetic
639	   diacritical markings in conventional alphabetical use have special
640	   mathematical meanings, applications that permit the use of these
641	   characters have their own issues with normalization and equality.

643	3.3.2.2.  The phonetic notation characters and extensions

645	   Another example involves various Phonetic Alphabet and Extension
646	   characters. many of which, unlike the Mathematical ones, do not have
647	   normalizations that would make them compare equal to the basic
648	   characters with essentially identical representations.  This would
649	   not be a problem for IDNA if they were identified with a specialized
650	   script or as symbols rather than letters, but neither is the case:
651	   they are generally identified as lower case Latin Script letters even
652	   when they are visually upper-case, another issue for IDNA.

654	3.3.2.3.  The stroke (solidus) ambiguity

656	   Some combining characters have two or more forms.  for example, in
657	   the case of the character popularly known as "slash", "stroke", or
658	   "solidus" (sometime prefixed by "forward"), there are "short" and
659	   "long" combining forms, U+0337 (COMBINING SHORT SOLIDUS OVERLAY) and
660	   U+0338 (COMBINING LONG SOLIDUS OVERLAY).  It is not clear how long a
661	   short one needs to be to make it "long" or how short a long one needs
662	   to be to make it "short".  Perhaps for that reason, U+00F8 has no
663	   decomposition and neither U+006F U+0337 nor U+006F U+0338 combine to
664	   it with NFC.

666	   Adding to the confusion, at least when one attempts to use Unicode
667	   character names to identify places to look for problems, U+00F8 is
668	   formally called LATIN SMALL LETTER O WITH STROKE but, in combining
669	   character terminology, the term "stroke" refers to a horizontal bar,
670	   not an angled one, as in U+0335 and U+0336 (also short and long
671	   versions).  However, when one overlays one of those on an "o"
672	   (U+006F), one gets U+0275, LATIN SMALL LETTER BARRED O, not "...o
673	   with stroke".  That character, by the way, does not decompose either.
674	   This does illustrate the principle that it is not feasible to rely on
675	   Unicode code point names to identify confusable character sequences,
676	   even ones that produce the same, more or less font-independent,
677	   grapheme clusters.

679	3.3.2.3.1.  Combining dots and other shapes combine... unless...

681	   The discussion of "Non-decomposition of Overlaid Diacritics"
682	   [Unicode70-Overlay] indirectly exhibits at least one reason why it
683	   has been difficult to characterize the problem.  If one combines that
684	   subsection with others, one gets a set of rules that might be
685	   described as:

687	   1.  If the precomposed character and the code points that make up the
688	       combining sequence exist, then canonical composition and
689	       decomposition work as expected, except...

691	   2.  If the precomposed character was added to Unicode after the code
692	       points that make up the combining sequence, normalization
693	       stability for the combining sequences requires that NFC applied
694	       to the precomposed character decomposes rather than having the
695	       combining sequence compose to the new character, however...

697	   3.  If the combining sequence involves a diacritic or other mark that
698	       actually touches the base character when composed, the
699	       precomposed character does not have a decomposition, unless...

701	   4.  The combining diacritic involved is Cedilla (U+0327), Ogonek
702	       (U+0328), or Horn (U+031B), in which case the precomposed
703	       characters that contain them "regularly" (but presumably not
704	       always) decomposes, and...

706	   5.  There are further exceptions for Hamza which does not overlay the
707	       associated base character in the same way the Latin-derived
708	       combining diacritics and other marks do.  Those decisions to
709	       decompose a precomposed character (or not) are based on language
710	       or phonetic considerations, not the combining mechanism or
711	       appearance, or perhaps,...

713	   6.  Some characters have compatibility decompositions rather than
714	       canonical ones [Unicode70-CompatDecomp].  Because compatibility
715	       relationships are treated differently by IDNA, PRECIS [RFC8264],
716	       and, potentially, other protocols involving identifiers for
717	       Internet use, the existence of compatibility relationship may or
718	       may not be helpful.  Finally,...

720	   7.  There is no reason to believe the above list is complete.  In
721	       particular, if whether a precomposed character decomposes or not
722	       is determined by language or phonetic distinctions or by a
723	       decision that all new characters for some scripts will be
724	       precomposed while new ones for others will be added (if needed)
725	       as combining sequences, one may need additional rules on a per-
726	       script and/or per-character basis.

728	   The above list only covers the cases involving combining sequences.
729	   It does not cover cases such as those in Section 3.3.2.1 and
730	   Section 3.3.2.2 and there may be additional groups of cases not yet
731	   identified.

733	3.3.2.3.2.  "Legacy" characters and new additions

735	   The development of categories and rules for IDNA recognized that
736	   early version of Unicode might contain some inconsistencies if
737	   evaluated using more contemporary rules about code point assignments
738	   and stability.  In particular, there might be some exceptions from
739	   different practices in early version of Unicode or anomalies caused
740	   by copying existing single- or dual-script standards into Unicode as
741	   block rather than individual character additions to the repertoire.
742	   The possibility of such "legacy" exceptions was one reason why the
743	   IDNA category rules include explicit provisions for exception lists
744	   (even though no such code points were identified prior to 2014).

746	3.3.3.  Unexpected Combining Sequances

748	   Most combining characters have the script property "Inherited" or
749	   "Common", i.e., are not members of any particular script and will not
750	   cause rules against mixed-script labels to be triggered.
751	   Normalization rules are generally structured around the base
752	   character, so unexpected combinations of base characters with
753	   combining ones may lead to cases where normalization might normally
754	   be expected to produce a precombined character but does not do so (in
755	   the most common situation because no such precombined character
756	   exists.  For example, the Latin script characters "a" and "a with
757	   acute accent" are both coded (as U+0061 and U+00E1).  If the latter
758	   is coded as the combining sequence U+0061 U+0301, NFC will turn that
759	   sequence into U+00E1 and everything will work as users expect.
760	   However, the Cyrillic "a" character (U+0430) is notoriously similar
761	   in appearance in most type styles to U+0061 and the U+0439 U+0301 and
762	   that sequence does not normalize to anything else.  Because thre is
763	   no code point assigned for Cyrillic small letter a with acute accent
764	   and unlike many of the other examples in this document, that is
765	   Unicode working exactly as would be expected.  Whether it is an issue
766	   or not depends on the questions that are being asked and what rules
767	   are being applied.

769	3.3.4.  Examples and Cases from Other Scripts

771	   Research into these issues has not yet turned up a comprehensive list
772	   of affected scripts and code points.  As discussed elsewhere in this
773	   document, it is clear that Arabic and Latin Scripts are significantly
774	   affected, that some Han and Kangxu radicals and ideographs are
775	   affected, and that other examples do exist -- it is just not known
776	   how many of those examples there are and what patterns, if any,
777	   characterize them.

779	3.3.4.1.  Scripts with precomposed preferences and ones with combining
780	          preferences

782	   While the authors have been unable to find an explanation for the
783	   differentiation in the Unicode Standard, we have been told that there
784	   are differences among scripts as to whether the action preference is
785	   to add new combining sequences only (and resist adding precomposed
786	   characters) as suggested in Section 3.3.2.3.1 or to add precomposed
787	   characters, often ones that do not have decompositions.  If those
788	   difference in preference do exist, it is probably important to have
789	   them documented so that they can be reflected in IDNA review
790	   procedures and elsewhere.  It will also require IETF discussion of
791	   whether combining sequences should be deprecated when the
792	   corresponding precomposed characters are added or to disallow
793	   combining sequences entirely for those scripts (as has been
794	   implicitly suggested for Arabic language use [RFC5564]).

796	   [[CREF2: The above isn't quite right and probably needs additional
797	   discussion and text.]]

799	3.3.4.2.  The Han and Kangxu Cases

801	   [[CREF3: .. to be supplied .. ]]

803	3.4.  Confusion and the Casual User

805	   To the extent to which predictability for relatively casual users is
806	   a desired and important feather of relevant application or
807	   application support protocols, it is probably worth observing that
808	   the complex of rules and cases suggested or implied above is almost
809	   certainly too involved for the typical such user to develop a good
810	   intuitive understanding of how things behave and what relationships
811	   exist.  Conversely, the nature of writing systems for natural
812	   languages, especially those that have evolved and diverged over
813	   centuries, implies that no set of rules about allowable characters
814	   will guarantee complete safety (however that is defined).

816	4.  Implementation options and issues: Unicode properties, exceptions,
817	    and the nature of stability

819	4.1.  Unicode Stability compared to IETF (and ICANN) Stability

821	   The various stability rules in Unicode [Unicode70-Stability] all
822	   appear to be based on the model that once a value is assigned, it can
823	   never be changed.  That is probably appropriate for a character
824	   coding system with multiple uses and applications.  It is probably
825	   the only option when normative relationships are expressed in tables
826	   of values rather than by rules.  One consequence of such a model is
827	   that it is difficult or impossible to fix mistakes (for some
828	   stability rules, the Unicode Standard does provide for exceptions)
829	   and even harder to make adjustments that would normally be dictated
830	   by evolution.

832	   "No changes" provides a very strong and predictable type of
833	   stability.  There are many reasons to take that path.  As in some of
834	   the cases that motivated this document, the difficulty is that simply
835	   adding new code points (in Unicode) or features (in a protocol or
836	   application) may be destabilizing.  One then has complete stability
837	   for systems that never use or allow the new code points or features,
838	   but rough edges for newer systems that see the discrepancies and
839	   rough edges.  IDNA2003 (inadvertently) took that approach by freezing
840	   on Unicode 3.2 -- if no code points added after Unicode 3.2 had ever
841	   been allowed, we would have had complete stability even as Unicode
842	   libraries changed.  Unicode has been quite ingenious about working
843	   around those difficulties with such provisions as having code points
844	   for newly-added precomposed characters decompose rather than altering
845	   the normalization for the combining sequences.  Other cases, such as
846	   newly-added precomposed characters that do not decompose for, e.g.,
847	   language or phonetic reasons, are more problematic.

849	   The IETF (and ICANN and standards development bodies such as ISO and
850	   ISO/IEC JTC1) have generally adopted a different type of stability
851	   model, one which considers experience in use and the ill effects of
852	   not making changes as well as the disruptive effects of doing so.  In
853	   the IETF model, if an earlier decision is causing sufficient harm and
854	   there is consensus in the communities that are most affected that a
855	   change is desirable enough to make transition costs acceptable, then
856	   the change is made.

858	   The difference and its implications are perhaps best illustrated by a
859	   disagreement when IDNA2008 was being approved.  IDNA2003 had
860	   effectively prevented some characters, notably (measured by intensity
861	   of the protests) the Sharp S character (U+00DF) from being used in
862	   DNS labels by mapping them to other characters before conversion to
863	   ACE form.  It has also prohibited some other code points, notably ZWJ
864	   (U+200D) and ZWNJ (U+200C), by discarding them.  In both cases, there
865	   were strong voices from the relevant language communities, supported
866	   by the registry communities, that the characters were important
867	   enough that it was more desirable to undergo the short-term pain of a
868	   transition and some uncertainty than to continue to exclude those
869	   characters and the IDNA2008 rules and repertoire are consistent with
870	   that preference.  The Unicode Consortium apparently believed that
871	   stability --elimination of any possibility of label invalidation or
872	   different interpretations of the same string-- was more important
873	   than those writing system requirements and community preferences.
874	   That view was expressed through what was effectively a fork in (or
875	   attempt to nullify) the IETF Standard [UTS46] a result that has
876	   probably been worse for the overall Internet than either of the
877	   possible decision choices.

879	4.2.  New Unicode Properties

881	   One suggestion about the way out of these problems would be to create
882	   one or more new Unicode properties, maintained along with the rest of
883	   Unicode, and then incorporated into new or modified rules or
884	   categories in IDNA.  Given the analysis in this document, it appears
885	   that that property (or properties) would need to provide:

887	   1.  Identification of combining characters that, when used in
888	       combining sequences, do not produce decomposable characters.
889	       [[CREF4: Wording on the above is not quite right but, for the
890	       present, maybe the intent is clear.]]

892	   2.  Identification of precomposed characters that might reasonably be
893	       expected to decompose, but that do not.

895	   3.  Identification of character forms that are distinct only because
896	       of language or phonetic distinctions within a script.

898	   4.  Identification of scripts for which precomposed forms are
899	       strongly preferred and combining sequences should either be
900	       viewed as temporary mechanisms until precomposed characters are
901	       assigned or banned entirely.

903	   5.  Identification of code points that represent symbols for
904	       specific, non-language, purposes even if identified as letters or
905	       numerals by their General Property.  This would include all
906	       characters given separate code points because of specialized
907	       "mathematical" and "phonetic" characters (see Section 3.3.2.2 and
908	       Section 3.3.2.1), but there are probably additional cases.

910	   Some of these properties (or characteristics or values of a single
911	   property) would be suitable for disallowing characters, code points,
912	   or contextual sequences that otherwise might be allowed by IDNA.
913	   Others would be more suitable for making equality comparisons come
914	   out as needed by IDNA, particularly to eliminate distinctions based
915	   on language context.

917	   While it would appear that appropriate rules and categories could be
918	   developed for IDNA (and, presumably, for PRECIS, etc.) if the problem
919	   areas are those identified in this document, it is not yet known
920	   whether the list is complete (and, hence, whether additional
921	   properties or information would be needed).

923	   Even with such properties, IDNA would still almost certainly need
924	   exception lists.  In addition, it is likely that stability rules for
925	   those properties would need to reflect IETF norms with arrangements
926	   for bringing the IETF and other communities into the discussion when
927	   tradeoffs are reviewed.

929	4.3.  The need for exception lists

931	   [[CREF5: Note in draft: this section is a partial placeholder and may
932	   need more elaboration.]]
933	   Issues with exception lists and the requirements for them are
934	   discussed in Section 2 above and in RFC 5894 [RFC5894].

936	5.  Proposed/ Alternative Changes to RFC 5892 for the issues first
937	    exposed by new code point U+08A1

939	   NOTE IN DRAFT: See the comments in the Introduction, Section 1 and
940	   the first paragraph of each Subsection below for the status of the
941	   Subsections that follow.  Each one, in combination with the material
942	   in Section 3 above, also provides information about the reasons why
943	   that particular strategy might or might not be appropriate.

945	   When the term "Category" followed by an upper-case letter appears
946	   below, it is s reference to a rule in RFC 5892.

948	5.1.  Disallow This New Code Point

950	   This option is almost certainly too Arabic-specific and does not
951	   solve, or even address, the underlying problem.  It also does not
952	   inherently generalize to non-decomposing precomposed code points that
953	   might be added in the future (whether to Arabic or other scripts)
954	   even though one could add more code points to Category F in the same
955	   way.

957	   If chosen by the community, this subsection would update the portion
958	   of the IDNA2008 specification that identifies rules for what
959	   characters are permitted [RFC5892] to disallow that code point.

961	   With the publication of this document, Section 2.6 ("Exceptions (F)")
962	   of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in
963	   Category F so that the rule itself reads:

965	      F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660,
966	                   0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668,
967	                   0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6,
968	                   06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 08A1, 0F0B,
969	                   3007, 302E, 302F, 3031, 3032, 3033, 3034, 3035,
970	                   303B, 30FB}

972	   and then add to the subtable designated
973	   "DISALLOWED -- Would otherwise have been PVALID"
974	   after the line that begins "07FA", the additional line:

976	      08A1; DISALLOWED # ARABIC LETTER BEH WITH HAMZA ABOVE

978	   This has the effect of making the cited code point DISALLOWED
979	   independent of application of the rest of the IDNA rule set to the
980	   current version of Unicode.  Those wishing to create domain name
981	   labels containing Beh with Hamza Above may continue to use the
982	   sequence

984	      U+0628, ARABIC LETTER BEH
985	      followed by

987	      U+0654, ARABIC HAMZA ABOVE

989	   which was valid for IDNA purposes in Unicode 5.0 and earlier and
990	   which continues to be valid.

992	   In principle, much the same thing could be accomplished by using the
993	   IDNA "BackwardCompatible" category (IDNA Category G, RFC 5892
994	   Section 5.3).  However, that category is described as applying only
995	   when "property values in versions of Unicode after 5.2 have changed
996	   in such a way that the derived property value would no longer be
997	   PVALID or DISALLOWED".  Because U+08A1 is a newly-added code point in
998	   Unicode 7.0.0 and no property values of code points in prior versions
999	   have changed, category G does not apply.  If that section of RFC 5892
1000	   were to be replaced in the future, perhaps consideration should be
1001	   given to adding Normalization Stability and other issues to that
1002	   description but, at present, it is not relevant.

1004	5.2.  Disallow This New Code Point and All Future Precomposed Additions
1005	      that Do Not Decompose

1007	   At least in principle, the approach suggested above (Section 5.1)
1008	   could be expanded to disallow all future allocations of non-
1009	   decomposing precomposed characters.  This would probably require
1010	   either a new Unicode property to identify such characters and/or more
1011	   emphasis on the manual, individual code point, checking of the new
1012	   Unicode version review proces (i.e,. not just application of the
1013	   existing rules and algorithm).  It might require either a new rule in
1014	   IDNA or a modification to the structure of Category F to make
1015	   additions less tedious.  It would do nothing for different ways to
1016	   form identical characters within the same script that were not
1017	   associated with decomposition and so would have to be used in
1018	   conjunction with other appropaches.  Finally, for scripts (such as
1019	   Arabic) where there is a very strong preference to avoid combining
1020	   sequences, this approach would exclude exactly the wrong set of
1021	   characters.

1023	5.3.  Disallow the combining sequences for these characters

1025	   As in the approach discussed in Section 5.1, this approach is too
1026	   Arabic-specific to address the more general problem.  However, it
1027	   illustrates a single-script approach and a possible mechanism for
1028	   excluding combining sequences whose handling is connected to language
1029	   information (information that, as discussed above, is not relevant to
1030	   the DNS).

1032	   If chosen by the community, this subsection would update the portion
1033	   of the IDNA2008 specification that identifies contextual rules
1034	   [RFC5892] to prohibit (combining) Hamza Above (U+0654) in conjunction
1035	   with Arabic BEH (U+0628), HAH (U+062D), and REH (U+0631).  Note that
1036	   the choice of this option is consistent with the general preference
1037	   for precomposed characters discussed above but would ban some labels
1038	   that are valid today and that might, in principle, be in use.

1040	   The required prohibition could be imposed by creating a new
1041	   contextual rule in RFC 5892 to constrain combining sequences
1042	   containing Hamza Above.

1044	   As the Unicode Standard points out at some length [Unicode70-Arabic],
1045	   Hamza is a problematic abstract character and the "Hamza Above"
1046	   construction even more so.  IDNA has historically associated
1047	   characters whose use is reasonable in some contexts but not others
1048	   with the special derived property "CONTEXTO" and then specified
1049	   specific, context-dependent, rules about where they may be used.
1050	   Because Hamza Above is problematic (and spawns edge cases, as
1051	   discussed in the Unicode Standard section cited above), it was
1052	   suggested that a contextual rule might be appropriate.  There are at
1053	   least two reasons why a contextual rule would not be suitable for the
1054	   present situation.

1056	   1.  As discussed above, the present situation is a normalization
1057	       stability and predictability problem, not a contextual one.  Had
1058	       the same issues arisen with a newly-added precomposed character
1059	       that could previously be constructed from non-problematic base
1060	       and combining characters, it would be even more clearly a
1061	       normalization issue and, following the principles discussed there
1062	       and particularly in UAX 15 [UAX15-Exclusion], might not have been
1063	       assigned at all.

1065	   2.  The contextual rule sets are designed around restricting the use
1066	       of code points to a particular script or adjacent to particular
1067	       characters within that script.  Neither of these cases applies to
1068	       the newly-added character even if one could imagine rules for the
1069	       use of Hamza Above (U+0654) that would reflect the considerations
1070	       of Chapter 8 of Unicode 6.2.  Even had the latter been desired,
1071	       it would be somewhat late now -- Hamza Above has been present as
1072	       a combining character (U+0654) in many versions of Unicode.
1073	       While that section of the Unicode Standard describes the issues,
1074	       it does not provide actionable guidance about what to do about it
1075	       for cases going forward or when visual identity is important.

1077	5.4.  Use Combinnig Classes to Develop Additional Contextual Rules

1079	   This option may not be of any practical use, but Unicode supports a
1080	   property called "Combining_Class".  That property has been used in
1081	   IDNA only to construct a contextual rule for Zero-Width Non-Joiner
1082	   [RFC5892, Appendix A.1] but speculation has arisen during discussions
1083	   of work on Arabic combining characters and rendering [UTR53] as to
1084	   whether Combining Classes could be used to build additional
1085	   contextual rules that would restrict problematic cases.  Unless such
1086	   rules were applied only to new code points, they would also not be
1087	   backward compatable.

1089	   The question of whether Combining Classes could be used to reduce the
1090	   number of problematic labels is at least worth examination.

1092	5.5.  Disallow all Combining Characters for Specific Scripts

1094	   [[CREF6: This subsection needs to be turned into prose, but the
1095	   follow bullet points are probably sufficient to identify the
1096	   issues.]]
1097	   o  Might work for Arabic and other "precomposed preference" scripts
1098	      if those can be identified in an orderly and stable way (see
1099	      Section 3.3.4.1; recommended by the Arabic language community for
1100	      IDNs [RFC5564]).

1102	   o  Unworkable for Latin because many characters that do not decompose
1103	      are, at least in part, historical accidents resulting from
1104	      combining prior national standards (this probably may exist for
1105	      other scripts as well).

1107	   o  No effect at all on special-use representations of identical
1108	      characters within a script (see Section 3.3.2.1 and
1109	      Section 3.3.2.2).

1111	   o  Not backwards compatible.

1113	5.6.  Do Nothing Other Than Warn

1115	   A recommendation from UTC and others has been to simply warn
1116	   registries, at all levels of the tree, to be careful with this set of
1117	   characters.  Doing that well would probably require making language
1118	   distinctions within zones, which would violate the important IDNA
1119	   principles that labels are not necessarily "words", do not carry
1120	   language information, and may, at the protocol level, even
1121	   deliberately mix languages and scripts.  It is also problematic
1122	   because the relevant set of characters is not easily defined in a
1123	   precise way.  This suggestion is problematic because the DNS and IDNA
1124	   cannot make or enforce language distinctions, but it would avoid
1125	   having the IETF either invalidate label strings that are potentially
1126	   now in use or creating inconsistencies among the characters that
1127	   combine with selected base characters but that also have precomposed
1128	   forms that do not have decompositions.  The potential would still
1129	   exist for registries to respect the warning and deprecate such labels
1130	   if they existed.

1132	   More generally, while there are already requirements in IDNA for
1133	   registries to be knowledgeable and responsible about the labels they
1134	   register (a separate document discusses that requirement
1135	   [Klensin-rfc5891bis]), experience indicates that those requirements
1136	   are often ignored.  At least as important, warning registries about
1137	   what should or should not be registered and even calling out specific
1138	   code points as dangerous and in need of extra attention
1139	   [Freytag-dangerous] does nothing to address the many cases in which
1140	   lookup-time checking for IDNA conformance and deliberately misleading
1141	   label constructions is important.

1143	5.7.  Normalization Form IETF (NFI))

1145	   The most radical possibility for the comparison issue would be to
1146	   decide that none of the Unicode Normalization Forms specified in UAX
1147	   15 [UAX15] are adequate for use with the DNS because, contrary to
1148	   their apparent descriptions, normalization tables are actually
1149	   determined using language information.  However, use of language
1150	   information is unacceptable for IDNA for reasons described elsewhere
1151	   in this document.  The remedy would be to define an IETF-specific (or
1152	   DNS-specific) normalization form (sometimes called "NFI" in
1153	   discussions), building on NFC but adhering strictly to the rule that
1154	   normalization causes two different forms of the same character (glyph
1155	   image) within the same script to be treated as equal.  In practice
1156	   such a form could be implemented for IDNA purposes as an additional
1157	   rule within RFC 5892 (and its successors) that constituted an
1158	   exception list for the NFC tables.  For this set of characters, the
1159	   special IETF normalization form would be equivalent to the exclusion
1160	   discussed in Section 5.3 above.

1162	   An Internet-identifier-specific normalization form, especially if
1163	   specified somewhat separately from the IDNA core, would have a small
1164	   marginal advantage over the other strategies in this section (or in
1165	   combination with some of them), even though most of the end result
1166	   and much of the implementation would be the same in practice.  While
1167	   the design of IDNA requires that strings be normalized as part of the
1168	   process of determining label validity (and hence before either
1169	   storage of values in the DNS or name resolution), there is an ongoing
1170	   debate about whether normalization should be performed before storing
1171	   a string or putting it on the wire or only when the string is
1172	   actually compared or otherwise used.

1174	   If a normalization procedure with the right properties for the IETF
1175	   was defined, that argument could be bypassed and the best decisions
1176	   made for different circumstances.  The separation would also allow
1177	   better comparison of strings that lack language context in
1178	   applications environments in which the additional processing and
1179	   character classifications of IDNA and/or PRECIS were not applicable.
1180	   Having such a normalization procedure defined outside IDNA would also
1181	   minimize changes to IDNA itself, which is probably an advantage.

1183	   If the new normalizstion form were, in practice, simply an overlay on
1184	   NFC with modifications dictated by exception and/or property lists,
1185	   keeping its definition separate from IDNA would also avoid
1186	   interweaving those exceptions and property lists with the rules and
1187	   categories of IDNA itself, avoiding some unnecessary complexity.

1189	6.  Editorial clarification to RFC 5892

1191	   Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a
1192	   clarification to Appendix A and Section A.1 of RFC 5892.  This
1193	   section of this document updates the RFC to apply that clarification.

1195	   1.  In Appendix A, add a new paragraph after the paragraph that
1196	       begins "The code point...".  The new paragraph should read:

1198	       "For the rule to be evaluated to True for the label, it MUST be
1199	       evaluated separately for every occurrence of the Code point in
1200	       the label; each of those evaluations must result in True."

1202	   2.  In Appendix A, Section A.1, replace the "Rule Set" by

1204	      Rule Set:
1205	        False;
1206	        If Canonical_Combining_Class(Before(cp)) .eq.  Virama Then True;
1207	        If cp .eq. \u200C And
1208	               RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp
1209	          (Joining_Type:T)*(Joining_Type:{R,D})) Then True;

1211	7.  Acknowledgements

1213	   The Unicode 7.0.0 changes were extensively discussed within the IAB's
1214	   Internationalization Program.  The authors are grateful for the
1215	   discussions and feedback there, especially from Andrew Sullivan and
1216	   David Thaler.  Additional information was requested and received from
1217	   Mark Davis and Ken Whistler and while they probably do not agree with
1218	   the necessity of excluding this code point or taking even more
1219	   drastic action as their responsibility is to look at the Unicode
1220	   Consortium requirements for stability, the decision would not have
1221	   been possible without their input.  Thanks to Bill McQuillan and Ted
1222	   Hardie for reading versions of the document carefully enough to
1223	   identify and report some confusing typographical errors.  Several
1224	   experts and reviewers who prefer to remain anonymous also provided
1225	   helpful input and comments on preliminary versions of this document.

1227	8.  IANA Considerations

1229	   When the IANA registry and tables are updated to reflect Unicode
1230	   7.0.0, changes should be made according to the decisions the IETF
1231	   makes about Section 5.

1233	9.  Security Considerations

1235	   From at least one point of view, this document is entirely a
1236	   discussion of a security issue or set of such issues.  While the
1237	   "similar-looking characters" issue that has been a concern since the
1238	   earliest days of IDNs [HomographAttack] and that has driven assorted
1239	   "character confusion" projects [ICANN-VIP], if a user types in a
1240	   string on one device and can get different results that do not
1241	   compare equal when it is typed on a different device (with both
1242	   behaving correctly and both keyboards appearing to be the same and
1243	   for the same script) then all security mechanism that depend on the
1244	   underlying identifiers, including the practical applications of DNS
1245	   response integrity checks via DNSSEC [RFC4033] and DNS-embedded
1246	   public key mechanisms [RFC6698], are at risk if different parties, at
1247	   least one of them malicious, obtain or register some of the
1248	   identical-appearing and identically-typed strings and get them into
1249	   appropriate zones.

1251	   Mechanisms that depend on trusting registration systems (e.g.,
1252	   registries and registrars in the DNS IDN case, see Section 5.6 above)
1253	   are likely to be of only limited utility because fully-qualified
1254	   domains that may be perfectly reasonable at the first level or two of
1255	   the DNS may have differences of this type deep in the tree, into
1256	   levels where name management, and often accountability, are weak.
1257	   Similar issues obviously apply when names are user-selected or
1258	   unmanaged.

1260	   When the issue is not a deliberate attack but simple accidental
1261	   confusion among similar strings, most of our strategies depend on the
1262	   acceptability of false negatives on matching if there is low risk of
1263	   false positives (see, for example, the discussion of false negatives
1264	   in identifier comparison in Section 2.1 of RFC 6943 [RFC6943]).
1265	   Aspects of that issue appear in, for example, RFC 3986 [RFC3986] and
1266	   the PRECIS effort [RFC8264].  However, because the cases covered here
1267	   are connected, not just to what the user sees but to what is typed
1268	   and where, there is an increased risk of false positives (accidental
1269	   as well as deliberate).

1271	   [[CREF7: Note in Draft: The paragraph that follows was written for a
1272	   much earlier version of this document.  It is obsolete, but is being
1273	   retained as a placeholder for future developments.]]

1275	   This specification excludes a code point for which the Unicode-
1276	   specified normalization behavior could result in two ways to form a
1277	   visually-identical character within the same script not comparing
1278	   equal.  That behavior could create a dream case for someone intending
1279	   to confuse the user by use of a domain name that looked identical to
1280	   another one, was entirely in the same script, but was still
1281	   considered different.

1283	   Internet Security in areas that involve internationalized identifiers
1284	   that might contain the relevant characters is therefore significantly
1285	   dependent on some effective resolution for the issues identified in
1286	   this document, not just hand waving, devout wishes, or appointment of
1287	   study committees about it.

1289	10.  References

1291	10.1.  Normative References

1293	   [RFC5137]  Klensin, J., "ASCII Escaping of Unicode Characters",
1294	              BCP 137, RFC 5137, DOI 10.17487/RFC5137, February 2008,
1295	              <https://www.rfc-editor.org/info/rfc5137>.

1297	   [RFC5890]  Klensin, J., "Internationalized Domain Names for
1298	              Applications (IDNA): Definitions and Document Framework",
1299	              RFC 5890, DOI 10.17487/RFC5890, August 2010,
1300	              <https://www.rfc-editor.org/info/rfc5890>.

1302	   [RFC5892]  Faltstrom, P., Ed., "The Unicode Code Points and
1303	              Internationalized Domain Names for Applications (IDNA)",
1304	              RFC 5892, DOI 10.17487/RFC5892, August 2010,
1305	              <https://www.rfc-editor.org/info/rfc5892>.

1307	   [RFC5892Erratum]
1308	              "RFC5892, "The Unicode Code Points and Internationalized
1309	              Domain Names for Applications (IDNA)", August 2010, Errata
1310	              ID: 3312", Errata ID 3312, August 2012,
1311	              <http://www.rfc-editor.org/errata_search.php?rfc=5892>.

1313	   [RFC5894]  Klensin, J., "Internationalized Domain Names for
1314	              Applications (IDNA): Background, Explanation, and
1315	              Rationale", RFC 5894, DOI 10.17487/RFC5894, August 2010,
1316	              <https://www.rfc-editor.org/info/rfc5894>.

1318	   [RFC6943]  Thaler, D., Ed., "Issues in Identifier Comparison for
1319	              Security Purposes", RFC 6943, DOI 10.17487/RFC6943, May
1320	              2013, <https://www.rfc-editor.org/info/rfc6943>.

1322	   [RFC8264]  Saint-Andre, P. and M. Blanchet, "PRECIS Framework:
1323	              Preparation, Enforcement, and Comparison of
1324	              Internationalized Strings in Application Protocols",
1325	              RFC 8264, DOI 10.17487/RFC8264, October 2017,
1326	              <https://www.rfc-editor.org/info/rfc8264>.

1328	   [UAX15]    Davis, M., Ed., "Unicode Standard Annex #15: Unicode
1329	              Normalization Forms", June 2014,
1330	              <http://www.unicode.org/reports/tr15/>.

1332	   [UAX15-Exclusion]
1333	              "Unicode Standard Annex #15: ob. cit., Section 5",
1334	              <http://www.unicode.org/reports/
1335	              tr15/#Primary_Exclusion_List_Table>.

1337	   [UAX15-Versioning]
1338	              "Unicode Standard Annex #15, ob. cit., Section 3",
1339	              <http://www.unicode.org/reports/tr15/#Versioning>.

1341	   [Unicode5]
1342	              The Unicode Consortium, "The Unicode Standard, Version
1343	              5.0", ISBN 0-321-48091-0, 2007.

1345	              Boston, MA, USA: Addison-Wesley.  ISBN 0-321-48091-0.
1346	              This printed reference has now been updated online to
1347	              reflect additional code points.  For code points, the
1348	              reference at the time RFC 5890-5894 were published is to
1349	              Unicode 5.2.

1351	   [Unicode62]
1352	              The Unicode Consortium, "The Unicode Standard, Version
1353	              6.2.0", ISBN 978-1-936213-07-8, 2012,
1354	              <http://www.unicode.org/versions/Unicode6.2.0/>.

1356	              Preferred citation: The Unicode Consortium.  The Unicode
1357	              Standard, Version 6.2.0, (Mountain View, CA: The Unicode
1358	              Consortium, 2012.  ISBN 978-1-936213-07-8)

1360	   [Unicode7]
1361	              The Unicode Consortium, "The Unicode Standard, Version
1362	              7.0.0", ISBN 978-1-936213-09-2, 2014,
1363	              <http://www.unicode.org/versions/Unicode7.0.0/>.

1365	              Preferred Citation: The Unicode Consortium.  The Unicode
1366	              Standard, Version 7.0.0, (Mountain View, CA: The Unicode
1367	              Consortium, 2014.  ISBN 978-1-936213-09-2)

1369	   [Unicode70-Arabic]
1370	              "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1371	              9.2: Arabic", Chapter 9, 2014,
1372	              <http://www.unicode.org/versions/Unicode7.0.0/ch09.pdf>.

1374	              Subsection titled "Encoding Principles", paragraph
1375	              numbered 4, starting on page 362.

1377	   [Unicode70-CompatDecomp]
1378	              "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1379	              2.3: Compatibility Characters", Chapter 2, 2014,
1380	              <http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>.

1382	              Subsection titled "Compatibility Decomposable Characters"
1383	              starting on page 26.

1385	   [Unicode70-Design]
1386	              "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1387	              2.2: Unicode Design Principles", Chapter 2, 2014,
1388	              <http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>.

1390	   [Unicode70-Hamza]
1391	              "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1392	              9.2: Arabic", Chapter 9, 2014,
1393	              <http://www.unicode.org/versions/Unicode7.0.0/ch09.pdf>.

1395	              Subsection titled "Combining Hamza Above" starting on page
1396	              378.

1398	   [Unicode70-Overlay]
1399	              "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1400	              2.2: Unicode Design Principles", Chapter 2, 2014,
1401	              <http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>.

1403	              Subsection titled "Non-decomposition of Overlaid
1404	              Diacritics" starting on page 64.

1406	   [Unicode70-Stability]
1407	              "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1408	              2.2: Unicode Design Principles", Chapter 2, 2014,
1409	              <http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>.

1411	              Subsection titled "Stability" starting on page 23 and
1412	              containing a link to http://www.unicode.org/policies/
1413	              stability_policy.html..

1415	   [UTS46]    Davis, M. and M. Suignard, "Unicode Technical Standard
1416	              #46: Unicode IDNA Compatibility Processing",
1417	              Version 7.0.0, June 2014,
1418	              <http://unicode.org/reports/tr46/>.

1420	10.2.  Informative References

1422	   [Dalby]    Dalby, A., "Dictionary of Languages: The definitive
1423	              reference to more than 400 languages", Columbia Univeristy
1424	              Press , 2004.

1426	              pages 206-207

1428	   [Daniels]  Daniels, P. and W. Bright, "The World's Writing Systems",
1429	              Oxford University Press , 1986.

1431	   [Freytag-dangerous]
1432	              Freytag, A., Klensin, J., and A. Sullivan, "Those
1433	              Troublesome Characters: A Registry of Unicode Code Points
1434	              Needing Special Consideration When Used in Network
1435	              Identifiers", June 2017,
1436	              <https://datatracker.ietf.org/doc/
1437	              draft-freytag-troublesome-characters/>.

1439	   [HomographAttack]
1440	              Gabrilovich, E. and A. Gontmakher, "The Homograph Attack",
1441	              Communications of the ACM 45(2):128, February 2002,
1442	              <http://www.cs.technion.ac.il/~gabr/papers/
1443	              homograph_full.pdf>.

1445	   [ICANN-VIP]
1446	              ICANN, "The IDN Variant Issues Project: A Study of Issues
1447	              Related to the Management of IDN Variant TLDs (Integrated
1448	              Issues Report)", February 2012,
1449	              <https://www.icann.org/en/system/files/files/
1450	              idn-vip-integrated-issues-final-clean-20feb12-en.pdf>.

1452	   [Klensin-rfc5891bis]
1453	              Klensin, J., "Internationalized Domain Names in
1454	              Applications (IDNA): Registry Restrictions and
1455	              Recommendations", September 2017,
1456	              <https://datatracker.ietf.org/doc/
1457	              draft-klensin-idna-rfc5891bis/>.

1459	   [Omniglot-Fula]
1460	              Ager, S., "Omniglot: Fula (Fulfulde, Pulaar,
1461	              Pular'Fulaare)",
1462	              <http://www.omniglot.com/writing/fula.htm>.

1464	              Captured 2015-01-07

1466	   [RFC0020]  Cerf, V., "ASCII format for network interchange", STD 80,
1467	              RFC 20, DOI 10.17487/RFC0020, October 1969,
1468	              <https://www.rfc-editor.org/info/rfc20>.

1470	   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
1471	              "Internationalizing Domain Names in Applications (IDNA)",
1472	              RFC 3490, DOI 10.17487/RFC3490, March 2003,
1473	              <https://www.rfc-editor.org/info/rfc3490>.

1475	   [RFC3492]  Costello, A., "Punycode: A Bootstring encoding of Unicode
1476	              for Internationalized Domain Names in Applications
1477	              (IDNA)", RFC 3492, DOI 10.17487/RFC3492, March 2003,
1478	              <https://www.rfc-editor.org/info/rfc3492>.

1480	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1481	              Resource Identifier (URI): Generic Syntax", STD 66,
1482	              RFC 3986, DOI 10.17487/RFC3986, January 2005,
1483	              <https://www.rfc-editor.org/info/rfc3986>.

1485	   [RFC4033]  Arends, R., Austein, R., Larson, M., Massey, D., and S.
1486	              Rose, "DNS Security Introduction and Requirements",
1487	              RFC 4033, DOI 10.17487/RFC4033, March 2005,
1488	              <https://www.rfc-editor.org/info/rfc4033>.

1490	   [RFC5564]  El-Sherbiny, A., Farah, M., Oueichek, I., and A. Al-Zoman,
1491	              "Linguistic Guidelines for the Use of the Arabic Language
1492	              in Internet Domains", RFC 5564, DOI 10.17487/RFC5564,
1493	              February 2010, <https://www.rfc-editor.org/info/rfc5564>.

1495	   [RFC6452]  Faltstrom, P., Ed. and P. Hoffman, Ed., "The Unicode Code
1496	              Points and Internationalized Domain Names for Applications
1497	              (IDNA) - Unicode 6.0", RFC 6452, DOI 10.17487/RFC6452,
1498	              November 2011, <https://www.rfc-editor.org/info/rfc6452>.

1500	   [RFC6698]  Hoffman, P. and J. Schlyter, "The DNS-Based Authentication
1501	              of Named Entities (DANE) Transport Layer Security (TLS)
1502	              Protocol: TLSA", RFC 6698, DOI 10.17487/RFC6698, August
1503	              2012, <https://www.rfc-editor.org/info/rfc6698>.

1505	   [Unicode32]
1506	              The Unicode Consortium, "The Unicode Standard, Version
1507	              3.2.0".

1509	              The Unicode Standard, Version 3.2.0 is defined by The
1510	              Unicode Standard, Version 3.0 (Reading, MA, Addison-
1511	              Wesley, 2000.  ISBN 0-201-61633-5), as amended by the
1512	              Unicode Standard Annex #27: Unicode 3.1
1513	              (http://www.unicode.org/reports/tr27/) and by the Unicode
1514	              Standard Annex #28: Unicode 3.2
1515	              (http://www.unicode.org/reports/tr28/).

1517	   [UTR53]    Unicode Consortium, "Proposed Draft: Unicode Technical
1518	              Report #53: Unicode Arabic Mark Ordering Algorithm",
1519	              August 2017, <http://www.unicode.org/reports/tr53/>.

1521	              Note: this is a Proposed Draft, out for public review when
1522	              this version of the current I-D is posted, and should not
1523	              be considered either an approved/ final document or a
1524	              stable reference.

1526	Appendix A.  Change Log

1528	   RFC Editor: Please remove this appendix before publication.

1530	A.1.  Changes from version -00 (2014-07-21)to -01

1532	   o  Version 01 of this document is an extensive rewrite and
1533	      reorganization, reflecting discussions with UTC members and adding
1534	      three more options for discussion to the original proposal to
1535	      simply disallow the new code point.

1537	A.2.  Changes from version -01 (2014-12-07) to -02

1539	   Corrected a typographical error in which Hamza Above was incorrectly
1540	   listed with the wrong code point.

1542	A.3.  Changes from version -02 (2014-12-07) to -03

1544	   Corrected a typographical error in the Abstract in which RFC 5892 was
1545	   incorrectly shown as 5982.

1547	A.4.  Changes from version -03 (2015-01-06) to -04

1549	   o  Explicitly identified the applicability of U+08A1 with Fula and
1550	      added references that discuss that language and how it is written.

1552	   o  Updated several Unicode 6.2 references to point to Unicode 7.0
1553	      since the latter is now available in stable form (it was done when
1554	      work on this I-D started).

1556	   o  Extensively revised to discuss the non-Arabic cases, non-
1557	      decomposing diacritics, other types of characters that don't
1558	      compare equal after normalization, and more general problem and
1559	      approaches.

1561	A.5.  Changes from version -04 (2015-03-11) to -05

1563	   o  Modified a few citation labels to make them more obvious.

1565	   o  Restructured Section 1 and added additional terminology comments.

1567	   o  Added discussion about non-decomposable character cases, including
1568	      the "slash" example, and associated references for which -04
1569	      contained only placeholders.

1571	   o  The examples and discussion of Latin script issues has been
1572	      expanded considerably.  It is unfortunate that many readers in the
1573	      IETF community apparently cannot understand examples well enough
1574	      to believe a problem is significant unless they is a discussion of
1575	      Latin script examples, but, at least for this working draft, that
1576	      is the way it is.

1578	   o  Rewrote the discussion of several of the alternatives and added
1579	      the discussion of combining classes.

1581	   o  Rewrote and extended the discussion of the "warn only"
1582	      alternative.

1584	   o  Several other sections modified to improve technical or editorial
1585	      clarity.

1587	   o  Note that, while some references have been updated, others have
1588	      not.  In particular, Unicode references are still tied to versions
1589	      6 or 7.  In some cases, those non-historical references are and
1590	      will remain appropriate; others will best be replaced with
1591	      information about current versions of documents.

1593	Authors' Addresses

1595	   John C Klensin
1596	   1770 Massachusetts Ave, Ste 322
1597	   Cambridge, MA  02140
1598	   USA

1600	   Phone: +1 617 245 1457
1601	   Email: john-ietf@jck.com
1602	   Patrik Faltstrom
1603	   Netnod
1604	   Franzengatan 5
1605	   Stockholm  112 51
1606	   Sweden

1608	   Phone: +46 70 6059051
1609	   Email: paf@netnod.se