idnits 2.17.1
draft-klensin-idna-5892upd-unicode70-04.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
== It seems as if not all pages are separated by form feeds - found 28 form
feeds but 744 pages
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
** The document seems to lack a both a reference to RFC 2119 and the
recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
keywords.
RFC 2119 keyword, line 1033: '...ated to True for the label, it MUST be...'
-- The draft header indicates that this document updates RFC5892, but the
abstract doesn't seem to directly say this. It does mention RFC5892
though, so this could be OK.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the IETF Trust and authors Copyright Line does not
match the current year
(Using the creation date from RFC5892, updated by this document, for
RFC5378 checks: 2008-04-26)
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- The document date (March 10, 2015) is 3335 days in the past. Is this
intentional?
Checking references for intended status: Proposed Standard
----------------------------------------------------------------------------
(See RFCs 3967 and 4897 for information about using normative references
to lower-maturity documents in RFCs)
-- Possible downref: Non-RFC (?) normative reference: ref.
'PRECIS-Framework'
-- Duplicate reference: RFC5892, mentioned in 'RFC5892Erratum', was also
mentioned in 'RFC5892'.
** Downref: Normative reference to an Informational RFC: RFC 5894
** Downref: Normative reference to an Informational RFC: RFC 6943
-- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15'
-- Possible downref: Non-RFC (?) normative reference: ref. 'UAX15-Exclusion'
-- Possible downref: Non-RFC (?) normative reference: ref.
'UAX15-Versioning'
-- Possible downref: Non-RFC (?) normative reference: ref. 'UTS46'
-- Possible downref: Non-RFC (?) normative reference: ref.
'Unicod70-CompatDecomp'
-- Possible downref: Non-RFC (?) normative reference: ref.
'Unicod70-Overlay'
-- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode5'
-- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode62'
-- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode7'
-- Possible downref: Non-RFC (?) normative reference: ref.
'Unicode70-Arabic'
-- Possible downref: Non-RFC (?) normative reference: ref.
'Unicode70-Design'
-- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode70-Hamza'
-- Possible downref: Non-RFC (?) normative reference: ref.
'Unicode70-Stability'
-- Obsolete informational reference (is this intentional?): RFC 3490
(Obsoleted by RFC 5890, RFC 5891)
Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 19 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group J. Klensin
3 Internet-Draft
4 Updates: 5892, 5894 (if approved) P. Faltstrom
5 Intended status: Standards Track Netnod
6 Expires: September 11, 2015 March 10, 2015
8 IDNA Update for Unicode 7.0.0
9 draft-klensin-idna-5892upd-unicode70-04.txt
11 Abstract
13 The current version of the IDNA specifications anticipated that each
14 new version of Unicode would be reviewed to verify that no changes
15 had been introduced that required adjustments to the set of rules
16 and, in particular, whether new exceptions or backward compatibility
17 adjustments were needed. The review for Unicode 7.0.0 first
18 identified a potentially problematic new code point and then a much
19 more general and difficult issue with Unicode normalization. This
20 specification discusses those issues and proposes updates to IDNA
21 and, potentially, the way the IETF handles comparison of identifiers
22 more generally, especially when there is no associated language or
23 language identification. It also applies an editorial clarification
24 to RFC 5892 that was the subject of an earlier erratum and updates
25 RFC 5894 to point to the issues involved.
27 Status of This Memo
29 This Internet-Draft is submitted in full conformance with the
30 provisions of BCP 78 and BCP 79.
32 Internet-Drafts are working documents of the Internet Engineering
33 Task Force (IETF). Note that other groups may also distribute
34 working documents as Internet-Drafts. The list of current Internet-
35 Drafts is at http://datatracker.ietf.org/drafts/current/.
37 Internet-Drafts are draft documents valid for a maximum of six months
38 and may be updated, replaced, or obsoleted by other documents at any
39 time. It is inappropriate to use Internet-Drafts as reference
40 material or to cite them other than as "work in progress."
42 This Internet-Draft will expire on September 11, 2015.
44 Copyright Notice
46 Copyright (c) 2015 IETF Trust and the persons identified as the
47 document authors. All rights reserved.
49 This document is subject to BCP 78 and the IETF Trust's Legal
50 Provisions Relating to IETF Documents
51 (http://trustee.ietf.org/license-info) in effect on the date of
52 publication of this document. Please review these documents
53 carefully, as they describe your rights and restrictions with respect
54 to this document. Code Components extracted from this document must
55 include Simplified BSD License text as described in Section 4.e of
56 the Trust Legal Provisions and are provided without warranty as
57 described in the Simplified BSD License.
59 Table of Contents
61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
62 2. Document Aspirations . . . . . . . . . . . . . . . . . . . . 6
63 3. Problem Description . . . . . . . . . . . . . . . . . . . . . 7
64 3.1. IDNA assumptions about Unicode normalization . . . . . . 7
65 3.2. The discovery and the Arabic script cases . . . . . . . . 9
66 3.2.1. New code point U+08A1, decomposition, and language
67 dependency . . . . . . . . . . . . . . . . . . . . . 9
68 3.2.2. Other examples of the same behavior within the Arabic
69 Script . . . . . . . . . . . . . . . . . . . . . . . 10
70 3.2.3. Hamza and Combining Sequences . . . . . . . . . . . . 10
71 3.3. Precomposed characters without decompositions more
72 generally . . . . . . . . . . . . . . . . . . . . . . . . 11
73 3.3.1. Description of the general problem . . . . . . . . . 11
74 3.3.2. Latin Examples and Cases . . . . . . . . . . . . . . 12
75 3.3.3. Examples and Cases from Other Scripts . . . . . . . . 14
76 3.3.4. Scripts with precomposed preferences and ones with
77 combining preferences . . . . . . . . . . . . . . . . 15
78 3.4. Confusion and the casual user . . . . . . . . . . . . . . 15
79 4. Implementation options and issues: Unicode properties,
80 exceptions, and the nature of stability . . . . . . . . . . . 15
81 4.1. Unicode Stability compared to IETF (and ICANN) Stability 15
82 4.2. New Unicode Properties . . . . . . . . . . . . . . . . . 17
83 4.3. The need for exception lists . . . . . . . . . . . . . . 18
84 5. Proposed/ Alternative Changes to RFC 5892 for the issues
85 first exposed by new code point U+08A1 . . . . . . . . . . . 18
86 5.1. Disallow This New Code Point . . . . . . . . . . . . . . 18
87 5.2. Disallow This New Code Point and All Future Precomposed
88 Additions that do not decompose . . . . . . . . . . . . . 19
89 5.3. Disallow the combining sequences for these characters . . 19
90 5.4. Disallow all Combining Characters for Specific Scripts . 21
91 5.5. Do Nothing Other Than Warn . . . . . . . . . . . . . . . 21
92 5.6. Normalization Form IETF (NFI)) . . . . . . . . . . . . . 21
93 6. Editorial clarification to RFC 5892 . . . . . . . . . . . . . 22
94 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 23
95 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 23
96 9. Security Considerations . . . . . . . . . . . . . . . . . . . 23
97 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 24
98 10.1. Normative References . . . . . . . . . . . . . . . . . . 24
99 10.2. Informative References . . . . . . . . . . . . . . . . . 27
100 Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 28
101 A.1. Changes from version -00 to -01 . . . . . . . . . . . . . 28
102 A.2. Changes from version -01 to -02 . . . . . . . . . . . . . 28
103 A.3. Changes from version -02 to -03 . . . . . . . . . . . . . 29
104 A.4. Changes from version -03 to -04 . . . . . . . . . . . . . 29
105 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 29
107 1. Introduction
109 Note in/about -04 Draft: This version of the document contains a
110 very large amount of new material as compared to the -03 version.
111 The new material reflects an evolution of community understanding
112 in the last two months from an assumption that the problem
113 involved only a few code points and one combining character in a
114 single script (Hamza Above and Arabic) to an understanding that it
115 is quite pervasive and may represent fundamental misunderstandings
116 or omissions from IDNA2008 (and, by extension, the basics of
117 PRECIS [PRECIS-Framework]) that must be corrected if those
118 protocols are going to be used in a way that supports Internet
119 internationalized identifiers predictability (as seen by the end
120 user) and security.
122 This version is still necessarily incomplete: not only is our
123 understanding probably still not comprehensive, but there are a
124 number of placeholders for text and references. Nonetheless, the
125 document in its current form should be useful as both the
126 beginning of a comprehensive overview is the issues and a source
127 of references to other relevant materials.
129 This draft could almost certainly be organized better to improve
130 its readability: specific suggestion would be welcome.
132 The current version of the IDNA specifications, known as "IDNA2008"
133 [RFC5890], anticipated that each new version of Unicode would be
134 reviewed to verify that no changes had been introduced that required
135 adjustments to IDNA's rules and, in particular, whether new
136 exceptions or backward compatibility adjustments were needed. When
137 that review was carefully conducted for Unicode 7.0.0 [Unicode7],
138 comparing it to prior versions including the text in Unicode 6.2
139 [Unicode62], it identified a problematic new code point (U+08A1,
140 ARABIC LETTER BEH WITH HAMZA ABOVE). The code point was added for
141 use with the Fula (also known as Fulfulde, Pulaar, amd Pular'Fulaare)
142 language, a language that, apparently, is most often written in Latin
143 characters today [Omniglot-Fula] [Dalby] [Daniels].
145 The specific problem is discussed in detail in Section 3. In very
146 broad terms, IDNA (and other IETF work) assume that, if one can
147 represent "the same character" either as a combining sequence or as a
148 single code point, strings that are identical except for those
149 alternate forms will compare equal after normalization. Part of the
150 difficulty that has characterized this discussion is that "the same"
151 differs depending on the criteria that are chosen.
153 The behavior of the newly-added code point, while non-optimal for
154 IDNA, follows that of a few code points that predate Unicode 7.x and
155 even the IDNA 2008 specifications and Unicode 6.0. Those existing
156 code points, which may not be easy to accurately characterize as a
157 group, make the question of what, if anything, to do about this new
158 exceedingly problematic one and, perhaps separately, what to do about
159 existing sets of code points with the same behavior, because
160 different reasonable criteria yield different decisions,
161 specifically:
163 o To disallow it (and future, but not existing characters with
164 similar characteristics) as an IDNA exception case creates
165 inconsistencies with how those earlier code points were handled.
167 o To disallow it and the similar code points as well would
168 necessitate invalidating some potential labels that would have
169 been valid under IDNA2008 until this time. Depending on how the
170 collection of similar code points is characterized, a few of them
171 are almost certainly used in reasonable labels.
173 o To permit the new code point to be treated as PVALID creates a
174 situation in which it is possible, within the same script, to
175 compose the same character symbol (glyph) in two different ways
176 that do not compare equal even after normalization. That
177 condition would then apply to it and the earlier code points with
178 the same behavior. That situation contradicts a fundamental
179 assumption of IDNA that is discussed in more detail below.
181 NOTE IN DRAFT:
183 This working draft discusses six alternatives, including an idea
184 (an IETF-specific normalization form) that seemed too drastic to
185 be considered a few months ago. However, it not only would have
186 been appropriate to discuss when the IDNA2008 specifications were
187 being developed but is appearing more attractive now. The authors
188 suggest that the community discuss the relevant tradeoffs and make
189 a decision and that the document then be revised to reflect that
190 decision, with the other alternatives discussed as options not
191 chosen. Because there is no ideal choice, the discussion of the
192 issues in Section 3, is probably as or more important than the
193 particular choice of how to handle this code point. In addition
194 to providing information for this document, that section should be
195 considered as an updating addendum to RFC 5894 [RFC5894] and
196 should be incorporated into any future revision of that document.
198 As the result of this version of the document containing several
199 alternate proposals, some of the text is also a little bit
200 redundant. That will be corrected in future versions.
202 As anticipated when IDNA2008, and RFC 5892 in particular, were
203 written, exceptions and explicit updates are likely to be needed only
204 if there is disagreement between the Unicode Consortium's view about
205 what is best for the Standard and the IETF's view of what is best for
206 IDNs, the DNS, and IDNA. It was hoped that a situation would never
207 arise in which the the two perspectives would disagree, but the
208 possibility was anticipated and considerable mechanism added to RFC
209 5890 and 5982 as a result. It is probably important to note that a
210 disagreement in this context does not imply that anyone is "wrong",
211 only that the two different groups have different needs and therefore
212 criteria about what is acceptable. For that reason, the IETF has, in
213 the past, allowed some characters for IDNA that active Unicode
214 Technical Committee members suggested be disallowed to avoid a change
215 in derived tables [RFC6452]. This document describes a case where
216 the IETF should disallow a character or characters that the various
217 properties would otherwise treat as PVALID.
219 This document provides the "flagging for the IESG" specified by
220 Section 5.1 of RFC 5892. As specified there, the change itself
221 requires IETF review because it alters the rules of Section 2 of that
222 document.
224 [[RFC Editor: please remove the following comment and note if they
225 get to you.]]
227 [[IESG: It might not be a bad idea to incorporate some version of
228 the following into the Last Call announcement.]]
230 NOTE IN DRAFT to IETF Reviewers: The issues in this document, and
231 particularly the choices among options for either adding exception
232 cases to RFC 5892 or ignoring the issue, warning people, and
233 hoping the results do not include serious problems, are fairly
234 esoteric. Understanding them requires that one have at least some
235 understanding of how the Arabic Script (and perhaps other scripts
236 in which precomposed characters are preferred over combining
237 sequences as a Unicode design and extension principle) works and
238 the reasons the Unicode Standard gives various Arabic Script
239 characters a fairly extended discussion [Unicode70-Arabic]. It
240 also requires understanding of a number of Unicode principles,
241 including the Normalization Stability rules [UAX15-Versioning] as
242 applied to new precomposed characters and guidelines for adding
243 new characters. There is considerable discussion of the issues in
244 Section 3 and references are provided for those who want to pursue
245 them, but potential reviewers should assume that the background
246 needed to understand the reasons for this change is no less deep
247 in the subject matter than would be expected of someone reviewing
248 a proposed change in, e.g., the fundamentals of BGP, TCP
249 congestion control, or some cryptographic algorithm. Put more
250 bluntly, one's ability to read or speak languages other than
251 English, or even one or more languages that use the Arabic script
252 or other scripts similarly affected, does not make one an expert
253 in these matters.
255 This document assumes that the reader is reasonably familiar with the
256 terminology of IDNA [RFC5890] and Unicode [Unicode7] and with the
257 IETF conventions for representing Unicode code points [RFC5137].
258 Some terms used here may not be used in the same way in those two
259 sets of documents. From one point of view, those differences may
260 have been the results of, or led to, misunderstandings that may, in
261 turn, be part of the root cause of the problems explored in this
262 document. In particular, this document uses the term "precomposed
263 character" to describe characters that could reasonably be composed
264 by a combining sequence using code points in the same but for which a
265 single code point that does not require combining sequences is
266 available. That definition is strictly about mechanical composition
267 and does not involve any considerations about how the character is
268 used. It is closely related to this document's definition of
269 "identical". When a precomposed character exists and either applying
270 NFC to the combining sequence does not yield that character or
271 applying NFD to that character's code point does not yield the
272 combining sequence, it is referred to in this document as "non-
273 decomposable"
275 2. Document Aspirations
277 This document, in its present form, is not a proposal for a solution.
278 Instead, it is intended to be (or evolve into) a comprehensive
279 description of the issues and problems and to outline some possible
280 approaches to a solution. A perfect solution -- one that would
281 resolve all of the issues identified in this document, would involve
282 a relatively small set of relatively simple rules and hence would be
283 comprehensible and predictable for and by non-expert end users, would
284 not require code point by code point or even block by block exception
285 lists, and would not leave uses of any script or language feeling
286 that their particular writing system have been treated less fairly
287 than others.
289 Part of the reality we need to accept is that IDNA, in its present
290 form, represents compromises that does not completely satisfy those
291 criteria and whatever is done about these issues will probably make
292 it (or the job of administering zones containing IDNs) more complex.
293 Similarly, as the Unicode Standard suggests when it identifies ten
294 Design Principles and the text then says "Not all of these principles
295 can be satisfied simultaneously..." [Unicode70-Design], while there
296 are guidelines and principles, a certain amount of subjective
297 judgment is involved in making determinations about normalization,
298 decomposition, and some property values. For Unicode itself, those
299 issues are resolved by multiple statements (at least one cited below)
300 that one needs to rely on per-code point information in the Unicode
301 Character Database rather than on rules or principles. The design of
302 IDNA and the effort to keep it largely independent of Unicode
303 versions requires rules, categories, and principles that can be
304 relied upon and applied algorithmically. There is obviously some
305 tension between the two approaches.
307 3. Problem Description
309 3.1. IDNA assumptions about Unicode normalization
311 IDNA makes several assumptions about Unicode, Unicode "characters",
312 and the effects of normalization. Those assumptions were based on
313 careful reading of the Unicode Standard at the time [Unicode5],
314 guided by advice and commitments by members of the Unicode Technical
315 Committee. Those assumptions, and the associated requirements, are
316 necessitated by three properties of DNS labels that typically do not
317 apply to blocks of running text:
319 1. There is no language context for a label. While particular DNS
320 zones may impose restrictions, including language or script
321 restrictions, on what labels can be registered, neither the DNS
322 nor IDNA impose either type of restriction or give the user of a
323 label any indication about the registration or other restrictions
324 that may have been imposed.
326 2. Labels are often mnemonics rather than words in any language.
327 They may be abbreviations or acronyms or contain embedded digits
328 and have other characteristics that are not typical of words.
330 3. Labels are, in practice, usually short. Even when they are the
331 maximum length allowed by the DNS and IDNA, they are typically
332 too short to provide significant context. Statements that
333 suggest that languages can almost always be determined from
334 relatively short paragraphs or equivalent bodies of text do not
335 apply to DNS labels because of their typical short length and
336 because, as noted above, they are not required to be formed
337 according to language-based rules.
339 At the same time, because the DNS is an exact-match system, there
340 must be no ambiguity about whether two labels are equal. Although
341 there have been extensive discussions about "confusingly similar"
342 characters, labels, and strings, such tests between scripts are
343 always somewhat subjective: they are affected by choices of type
344 styles and by what the user expects to see. In spite of the fact
345 that the glyphs that represent many characters in different scripts
346 are identical in appearance (e.g., basic Latin "a" (U+0061) and the
347 identical-appearing Cyrillic character (U+0430), the most important
348 test is that, if two glyphs are the same within a given script, they
349 must represent the same character no matter how they are formed.
351 Unicode normalization, as explained in [UAX15], is expected to
352 resolve those "same script, same glyph, different formation methods"
353 issues. Within the Latin script, the code point sequence for lower
354 case "o" (U+006F) and combining diaeresis (U+0308) will, when
355 normalized using the "NFC" method required by IDNA, produce the
356 precomposed small letter o with diaeresis (U+00F6) and hence the two
357 ways of forming the character will compare equal (and the combining
358 sequence is effectively prohibited from U-labels).
360 NFC was preferred over other normalization methods for IDNA because
361 it is more compact, more likely to be produced on keyboards on which
362 the relevant characters actually appeared, and because it does not
363 lose substantive information (e.g., some types of compatibility
364 equivalence involves judgment calls as to whether two characters are
365 actually the same -- they may be "the same" in some contexts but not
366 others -- while canonical equivalence is about different ways to
367 produce the glyph for the same abstract character).
369 IDNA also assumed that the extensive Unicode stability rules would be
370 applied and work as specified when new code points were added. Those
371 rules, as described in The Unicode Standard and the normative annexes
372 identified below, provide that:
374 1. New code points representing precomposed characters that can be
375 formed from combining sequences will not be added to Unicode
376 unless neither the relevant base character nor required combining
377 character(s) are part of the Standard within the relevant script
378 [UAX15-Versioning].
380 2. If circumstances require that principle be violated,
381 normalization stability requires that the newly-added character
382 decompose (even under NFC) to the previously-available combining
383 sequence [UAX15-Exclusion].
385 At least at the time IDNA2008 was being developed, there was no
386 explicit provision in the Standard's discussion of conditions for
387 adding new code points, nor of normalization stability, for an
388 exception based on different languages using the same script or
389 ambiguities about the shape or positioning of combining characters.
391 3.2. The discovery and the Arabic script cases
393 While the set of problems with normalization discussed above were
394 discovered with a newly-added code point for the Arabic Script and
395 some characteristics of Unicode handling of that script seem to make
396 the problem more complex going forward, these are not issues specific
397 to Arabic. This section describes the Arabic-specific problems;
398 subsequent ones (starting with Section 3.3) discuss the problem more
399 generally and include illustrations from other scripts.
401 3.2.1. New code point U+08A1, decomposition, and language dependency
403 Unicode 7.0.0 introduces the new code point U+08A1, ARABIC LETTER BEH
404 WITH HAMZA ABOVE. As can be deduced from the name, it is visually
405 identical to the glyph that can be formed from a combining sequence
406 consisting of the code point for ARABIC LETTER BEH (U+0628) and the
407 code point for Combining Hamza Above (U+0654). The two rules
408 summarized above (see the last part of Section 3.1) suggest that
409 either the new code point should not be allocated at all or that it
410 should have a decomposition to \u'0628'\u'0654'.
412 Had the issues outlined in this document been better understood at
413 the time, it probably would have been wise for RFC 5892 to disallow
414 either the precomposed character or the combining sequence of each
415 pair in those cases in which Unicode normalization rules do not cause
416 the right thing to happen, i.e., the combining sequence and
417 precomposed character to be treated as equivalent. Failure to do so
418 at the time places an extra burden on registries to be sure that
419 conflicts (and the potential for confusion and attacks) do not exist.
420 Oddly, had the exclusion been made part of the specification at that
421 time, the preference for precomposed forms noted above would probably
422 have dictated excluding the combining sequence, something not
423 otherwise done in IDNA2008 because the NFC requirement serves the
424 same purpose. Today, the only thing that can be excluded without the
425 potential disruption of disallowing a previously-PVALID combining
426 sequence is the to exclude the newly-added code point so whatever is
427 done, or might have been contemplated with hindsight, will be
428 somewhat inconsistent.
430 3.2.2. Other examples of the same behavior within the Arabic Script
432 One of the things that complicates the issue with the new U+08A1 code
433 point is that there are several other Arabic-script code points that
434 behave in the same way for similar language-specific reasons.
436 In particular, at least three other grapheme clusters that have been
437 present for many version of Unicode can be seen as involving issues
438 similar to those for the newly-added ARABIC LETTER BEH WITH HAMZA
439 ABOVE. ARABIC LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER
440 REH WITH HAMZA ABOVE (U+076C) do not have decomposition forms and are
441 preferred over combining sequences using HAMZA ABOVE (U+0654)
442 [Unicode70-Hamza]. By contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE
443 (U+0623) decomposes into \u'0627'\u'0654', ARABIC LETTER WAW WITH
444 HAMZA ABOVE (U+0624) decomposes into \u'0648'\u'0654', and ARABIC
445 LETTER YEH WITH HAMZA ABOVE (U+0626) decomposes into \u'064A'\u'0654'
446 so the precomposed character and combining sequences compare equal
447 when both are normalized, as this specification prefers.
449 There are other variations in which a precomposed character involving
450 HAMZA ABOVE has a decomposition to a combining sequence that can form
451 it. For example, ARABIC LETTER U WITH HAMZA ABOVE (U+0677) has a
452 compatibility decomposition. but not a canonical one, into the
453 combining sequence \u'06C7'\u'0674'.
455 3.2.3. Hamza and Combining Sequences
457 As the Unicode Standard points out at some length [Unicode70-Arabic],
458 Hamza is a problematic abstract character and the "Hamza Above"
459 construction even more so [Unicode70-Hamza]. Those sections explain
460 a distinction made by Unicode between the use of a Hamza mark to
461 denote a glottal stop and one used as a diacritic mark to denote a
462 separate letter. In the first case, the combining sequence is used.
463 In the second, a precomposed character is assigned.
465 Unlike Unicode generally and because of concerns about identifier
466 spoofing and attacks based on similarities, character distinctions in
467 IDNA are based much more strictly on the appearance of characters;
468 language and pronunciation distinctions within a script are not
469 considered. So, for IDNA, BEH WITH HAMZA ABOVE is not-quite-
470 tautologically the same as BEH WITH HAMZA ABOVE, even if one of them
471 is written as U+08A1 (new to Unicode 7.0.0) and the other as the
472 sequence \u'0628'\u'0654' (feasible with Unicode 7.0.0 but also
473 available in versions of Unicode going back at least to the version
474 [Unicode32] used in the original version of IDNA [RFC3490]. Because
475 the precomposed form and combining sequence are, for IDNA purposes,
476 the same, IDNA expects that normalization (specifically the
477 requirement that all U-labels be in NFC form) will cause them to
478 compare equal.
480 If Unicode also considered them the same, then the principle would
481 apply that new precomposed ("composition") forms are not added unless
482 one of the code points that could be used to construct it did not
483 exist in an earlier version (and even then is discouraged)
484 [UAX15-Versioning]. When exceptions are made, they are expected to
485 conform to the rules and classes in the "Composition Exclusion
486 Table", with class 2 being relevant to this case [UAX15-Exclusion].
487 That rule essentially requires that the normalization for the old
488 combining sequence to itself be retained (for stability) but that the
489 newly-added character be treated as canonically decomposable and
490 decompose back to the older sequence even under NFC. That was not
491 done for this particular case, presumably because of the distinction
492 about pronunciation modifiers versus separate letters noted above.
493 Because, for IDNA and the DNS, there is a possibility that the
494 composing sequence \u'0628'\u'0654' already appears in labels, the
495 only choice other than allowing an otherwise-identical, and
496 identically-appearing, label with U+08A1 substituted to identify a
497 different DNS entry is to DISALLOW the new character.
499 3.3. Precomposed characters without decompositions more generally
501 3.3.1. Description of the general problem
503 As mentioned above, IDNA made a strong assumption that, if there were
504 two ways to form the same abstract character in the same script,
505 normalization would result in them comparing equal. Work on IDNA2008
506 recognized that early version of Unicode might also contain some
507 inconsistencies; see Section 3.3.2.4 below.
509 Having precomposed code points exist that don't have decompositions,
510 or having them allocated in the future, is problematic for those IDNA
511 assumptions about character comparison, and seems to call for either
512 excludng some set of code points that IDNA's rules do not now
513 identify, to develop and use a normalization procedure that behaves
514 as expected (those two options may be nearly equivalent for many
515 purposes) or deciding to accept a risk that, apparently, will only
516 increase over time.
518 It is not clear whether the reasons the IDNABIS WG did not understand
519 and allow for these cases are important except insofar as they inform
520 considerations about what to do in the future. It seemed (and still
521 seems to some people) that the Unicode Standard is very clear on the
522 matter. In addition to the normalization stability rules cited in
523 the last part of Section 3.1. the discussion in the Core Standard
524 seems quite clear. For example, "Where characters are used in
525 different ways in different languages, the relevant properties are
526 normally defined outside the Unicode Standard" in Section 2.2,
527 subsection titled "Semantics" [Unicode7] did not suggest to most
528 readers that sometime separate code points would be allocated within
529 a script based on language considerations. Similarly, the same
530 section of the Standard says, in a subsection titled "Unification",
531 "The Unicode Standard avoids duplicate encoding of characters by
532 unifying them within scripts across language" and does not list
533 exceptions to that rule or limit it to a single script although it
534 goes on to list "CJK" as an example. Another subsection, "Equivalent
535 Sequences" indicates "Common precomposed forms ... are included for
536 compatibility with current standards. For static precomposed forms,
537 the standard provides a mapping to an equivalent dynamically composed
538 sequence of characters". The latter appears to be precisely the "all
539 precomposed characters decompose into the relevant combining
540 sequences if the relevant base and combining characters exist in the
541 Standard" that IDNA needs and assumed and, again, there is no mention
542 of exceptions, language-dependent of otherwise. The summary of
543 stabiiity policies cited in the Standard [Unicode70-Stability] does
544 not appear to shed any additional light on these issues.
546 The Standard now contains a subsection titled "Non-decomposition of
547 Overlaid Diacritics" [Unicod70-Overlay] that identifies a list of
548 diacritics that do not normally form characters that have
549 decompositions. The rule given has its own exceptions and the text
550 clearly states that there is actually no way to know whether a code
551 point has a decomposition other than consulting the Unicode Character
552 Database entry for that code point. The subsequent section notes
553 that this can be a security problem; while the issues with IDNA go
554 well beyond what is normally considered security, that comment now
555 seems clear. While that subsection is helpful in explaining the
556 problem, especially for European scripts, it does not appear in the
557 Unicode versions that were current when IDNA2008 was being developed.
559 3.3.2. Latin Examples and Cases
561 While this set of problems was discovered because of a code point
562 added to the Arabic script in precombined form to support a
563 particular language, there are actually far more examples for, e.g.,
564 Latin script than there are for Arabic script. Many of them are
565 associated with the "non-decomposition of combining diacriticals"
566 issues mentioned above, but the next subsections describe other cases
567 that are not directly bound to decomposition.
569 3.3.2.1. The font exclusion and compatability relationships
571 Unicode contains a large collection of characters that are identified
572 as "Mathematical Symbols". A large subset of them are basic or
573 decorated Latin characters, differing from the ordinary ones only by
574 their usage and, in appearance, by font or type styling (despite the
575 general principle that font distinctions are not used as the basis
576 for assigning separate code points. Most of these have canonical
577 mappings to the base form, which eliminates them from IDNA, but
578 others do not and, because the same marks that are used as phonetic
579 diacritical markings in conventional alphabetical use have special
580 mathematical meanings, applications that permit the use of these
581 characters have their own issues with normalization and equality.
583 3.3.2.2. The phonetic notation characters and extensions
585 Another example involves various Phonetic Alphabet and Extension
586 characters. many of which, unlike the Mathematical ones, do not have
587 normalizations that would make them compare equal to the basic
588 characters with essentially identical representations. This would
589 not be a problem for IDNA if they were identified with a specialize
590 script or as symbols rather than letters, but neither is the case:
591 they are generally identified as lower case Latin Script letters even
592 when they are visually upper-case, another issue for IDNA.
594 3.3.2.3. Combineng dots and other shapes combine... unless...
596 The discussion of "Non-decomposition of Overlaid Diacritics"
597 [Unicod70-Overlay] indirectly exhibits at least one reason why it has
598 been difficult to characterize the problem. If one combines that
599 subsection with others, one gets a set of rules that might be
600 described as:
602 1. If the precomposed character and the code points that make up the
603 combining sequence exist, then canonical composition and
604 decomposition work as expected, except...
606 2. If the precomposed character was added to Unicode after the code
607 points that make up the combining sequence, normalization
608 stability for the combining sequences requires that NFC applied
609 to the precomposed character decomposes rather than having the
610 combining sequence compose to the new character, however...
612 3. If the combining sequence involves a diacritic or other mark that
613 actually touches the base character when composed, the
614 precomposed character does not have a decomposition, unless...
616 4. The combining diacritic involved is Cedilla (U+0327), Ogonek
617 (U+0328), or Horn (U+031B), in which case the precomposed
618 characters that contain them "regularly" (but presumably not
619 always), and...
621 5. There are further exceptions for Hamza (which does not overlay
622 the associated base character in the same way the Latin-derived
623 combining diacritics and other marks do. Those decisions to
624 decompose a precomposed character (or not) are based on language
625 or phonetic considerations, not the combining mechanism or
626 appearance, or perhaps,...
628 6. Some characters have compatibility decompositions rather than
629 canonical ones [Unicod70-CompatDecomp]. Because compatibility
630 relationships are treated differently by IDNA, PRECIS
631 [PRECIS-Framework], and, potentially, other protocols involving
632 identifiers for Internet use, the existence of compatibility
633 relationship may or may not be helpful. Finally,...
635 7. There is no reason to believe the above list is complete. In
636 particular, if whether a precomposed character decomposes or not
637 is determined by language or phonetic distinctions, one may need
638 additional rules on a per-script and/or per-character basis.
640 The above list only covers the cases involving combining sequences.
641 It does not cover cases such as those in Section 3.3.2.1 and
642 Section 3.3.2.2 and there may be additional groups of cases not yet
643 identified.
645 3.3.2.4. "Legacy" characters and new additions
647 The development of categories and rules for IDNA recognized that
648 early version of Unicode might contain some inconsistencies if
649 evaluated using more contemporary rules about code point assignments
650 and stability. In particular, there might be some exceptions from
651 different practices in early version of Unicode or anomalies caused
652 by copying existing single- or dual-script standards into Unicode as
653 block rather than individual character additions to the repertoire.
654 The possibility of such "legacy" exceptions was one reason why the
655 IDNA category rules include explicit provisions for exception lists
656 (even though no such code points were identified prior to 2014).
658 3.3.3. Examples and Cases from Other Scripts
660 Research into these issues has not yet turned up a comprehensive list
661 of affected scripts and code points. As discussed elsewhere in this
662 document, it is clear that Arabic and Latin Scripts are significantly
663 affected, that some Han and Kangxu radicals and ideographs are
664 affected, and that other examples do exist -- it is just not known
665 how many of those examples there are and what patterns, if any,
666 characterize them.
668 3.3.4. Scripts with precomposed preferences and ones with combining
669 preferences
671 While the authors have been unable to find an explanation for the
672 differentiation in the Unicode Standard, we have been told that there
673 are differences among scripts as to whether the action preference is
674 to add new combining sequences only (and resist adding precomposed
675 characters) as suggested in Section 3.3.2.3 or to add precomposed
676 characters, often ones that do not have decompositions. If those
677 difference in preference do exist, it is probably important to have
678 them documented so that they can be reflected in IDNA review
679 procedures and elsewhere. It will also require IETF discussion of
680 whether combining sequences should be deprecated when the
681 corresponding precomposed characters are added or to disallow
682 combining sequences entirely for those scripts (as has been
683 implicitly suggested for Arabic language use [RFC5564]).
685 [[CREF1: The above isn't quite right and probably needs additional
686 discussion and text.]]
688 3.4. Confusion and the casual user
690 To the extent to which predictability for relatively casual users is
691 a desired and important feather of relevant application or
692 application support protocols, it is probably worth observing that
693 the complex of rules and cases above is almost certainly too involved
694 for the typical such user to develop a good intuitive understanding
695 of how things behave and what relationships exist.
697 4. Implementation options and issues: Unicode properties, exceptions,
698 and the nature of stability
700 4.1. Unicode Stability compared to IETF (and ICANN) Stability
702 The various stability rules in Unicode [Unicode70-Stability] all
703 appear to be based on the model that once a value is assigned, it can
704 never be changed. That is probably appropriate for a character
705 coding system with multiple uses and applications. It is probably
706 the only option when normative relationships are expressed in tables
707 of values rather than by rules. One consequence of such a model is
708 that it is difficult or impossible to fix mistakes (for some
709 stability rules, the Unicode Standard does provide for exceptions)
710 and even harder to make adjustments that would normally be dictated
711 by evolution.
713 "No changes" provides a very strong and predictable type of stability
714 and there are many reasons to take that path. As in some of the
715 cases that motivated this document, the difficulty is that simply
716 adding new code points (in Unicode) or features (in a protocol or
717 application) may be destabilizing. One then has complete stability
718 for systems that never use or allow the new code points or features,
719 but rough edges for newer systems that see the discrepancies and
720 rough edges. IDNA2003 (inadvertently) took that approach by freezing
721 on Unicode 3.2 -- if no code points added after Unicode 3.2 had ever
722 been allowed, we would have had complete stability even as Unicode
723 libraries changed. Unicode has been quite ingenious about working
724 around those difficulties with such provisions as having code points
725 for newly-added precomposed characters decompose rather than altering
726 the normalization for the combining sequences. Other cases, such as
727 newly-added precomposed characters that do not decompose for, e.g.,
728 language or phonetic reasons, are more problematic.
730 The IETF (and ICANN and standards development bodies such as ISO and
731 ISO/IEC JTC1) have generally adopted a different type of stability
732 model, one which considers experience in use and the ill effects of
733 not making changes as well as the disruptive effects of doing so. In
734 the IETF model, if an earlier decision is causing sufficient harm and
735 there is consensus in the communities that are most affected that a
736 change is desirable enough to make transition costs acceptable, then
737 the change is made.
739 The difference and its implications are perhaps best illustrated by a
740 disagreement when IDNA2008 was being approved. IDNA2003 had
741 effectively prevented some characters, notably (measured by intensity
742 of the protests) the Sharp S character (U+00DF) from being used in
743 DNS labels by mapping them to other characters before conversion to
744 ACE form. It has also prohibited some other code points, notably ZWJ
745 (U+200D) and ZWNJ (U+200C), by discarding them. In both cases, there
746 were strong voices from the relevant language communities, supported
747 by the registry communities, that the characters were important
748 enough that it was more desirable to undergo the short-term pain of a
749 transition and some uncertainty than to continue to exclude those
750 characters and the IDNA2008 rules and repertoire are consistent with
751 that preference. The Unicode Consortium apparently believed that
752 stability --elimination of any possibility of label invalidation or
753 different interpretations of the same string-- was more important
754 than those writing system requirements and community preferences.
755 That view was expressed through what was effectively a fork in (or
756 attempt to nullify) the IETF Standard [UTS46] a result that has
757 probably been worse for the overall Internet than either of the
758 possible decision choices.
760 4.2. New Unicode Properties
762 One suggestion about the way out of these problems would be to create
763 one or more new Unicode properties, maintained along with the rest of
764 Unicode, and then incorporated into new or modified rules or
765 categories in IDNA. Given the analysis in this document, it appears
766 that that property (or properties) would need to provide:
768 1. Identification of combining characters that, when used in
769 combining sequences, do not produce decomposable characters.
770 [[CREF2: Wording on the above is not quite right but, for the
771 present, maybe the intent is clear.]]
773 2. Identification of precomposed characters that might reasonably be
774 expected to decompose, but that do not.
776 3. Identification of character forms that are distinct only because
777 of language or phonetic distinctions within a script.
779 4. Identification of scripts for which precomposed forms are
780 strongly preferred and combining sequences should either be
781 viewed as temporary mechanisms until precomposed characters are
782 assigned or banned entirely.
784 5. Identification of code points that represent symbols for
785 specific, non-language, purposes even if identified as letters or
786 numerals by their General Property (see Section 3.3.2.2 and
787 Section 3.3.2.1).
789 Some of these properties (or characteristics or values of a single
790 property) would be suitable for disallowing characters, code points,
791 or contextual sequences that otherwise might be allowed by IDNA.
792 Others would be more suitable for making equality comparisons come
793 out as needed by IDNA, particularly to eliminate distinctions based
794 on language context.
796 While it would appear that appropriate rules and categories could be
797 developed for IDNA (and, presumably, for PRECIS, etc.) if the problem
798 areas are those identified in this document, it is not yet known
799 whether the list is complete (and, hence, whether additional
800 properties or information would be needed.
802 Even with such properties, IDNA would still almost certainly need
803 exception lists. In addition, it is likely that stability rules for
804 those properties would need to reflect IETF norms with arrangements
805 for bringing the IETF and other communities into the discussion when
806 tradeoffs are reviewed.
808 4.3. The need for exception lists
810 [[CREF3: Note in draft: this section is a partial placeholder and may
811 need more elaboration.]]
812 Issues with exception lists and the requirements for them are
813 discussed in Section 2 above and RFC 5894 [RFC5894].
815 5. Proposed/ Alternative Changes to RFC 5892 for the issues first
816 exposed by new code point U+08A1
818 NOTE IN DRAFT: See the comments in the Introduction, Section 1 and
819 the first paragraph of each Subsection below for the status of the
820 Subsections that follow. Each one, in combination with the material
821 in Section 3 above, also provides information about the reasons why
822 that particular strategy might or might not be appropriate.
824 5.1. Disallow This New Code Point
826 This option is almost certainly too Arabic-specific and does not
827 solve, or even address, the underlying problem. It also does not
828 inherently generalize to non-decomposing precomposed code points that
829 might be added in the future (whether to Arabic or other scripts)
830 even though one could add more code points to Category F in the same
831 way.
833 If chosen by the community, this subsection would update the portion
834 of the IDNA2008 specification that identifies rules for what
835 characters are permitted [RFC5892] to disallow that code point.
837 With the publication of this document, Section 2.6 ("Exceptions (F)")
838 of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in
839 Category F so that the rule itself reads:
841 F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660,
842 0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668,
843 0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6,
844 06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 08A1, 0F0B,
845 3007, 302E, 302F, 3031, 3032, 3033, 3034, 3035,
846 303B, 30FB}
848 and then add to the subtable designated
849 "DISALLOWED -- Would otherwise have been PVALID"
850 after the line that begins "07FA", the additional line:
852 08A1; DISALLOWED # ARABIC LETTER BEH WITH HAMZA ABOVE
854 This has the effect of making the cited code point DISALLOWED
855 independent of application of the rest of the IDNA rule set to the
856 current version of Unicode. Those wishing to create domain name
857 labels containing Beh with Hamza Above may continue to use the
858 sequence
860 U+0628, ARABIC LETTER BEH
861 followed by
863 U+0654, ARABIC HAMZA ABOVE
865 which was valid for IDNA purposes in Unicode 5.0 and earlier and
866 which continues to be valid.
868 In principle, much the same thing could be accomplished by using the
869 IDNA "BackwardCompatible" category (IDNA Category G, RFC 5892
870 Section 5.3). However, that category is described as applying only
871 when "property values in versions of Unicode after 5.2 have changed
872 in such a way that the derived property value would no longer be
873 PVALID or DISALLOWED". Because U+08A1 is a newly-added code point in
874 Unicode 7.0.0 and no property values of code points in prior versions
875 have changed, category G does not apply. If that section of RFC 5892
876 were to be replaced in the future, perhaps consideration should be
877 given to adding Normalization Stability and other issues to that
878 description but, at present, it is not relevant.
880 5.2. Disallow This New Code Point and All Future Precomposed Additions
881 that do not decompose
883 At least in principle, the approach suggested above (Section 5.1)
884 could be expanded to disallow all future allocations of non-
885 decomposing precomposed characters. This would probably require
886 either a new Unicode property to identify such characters and/or more
887 emphasis on the manual, individual code point, checking of the new
888 Unicode version review proces (i.e,. not just application of the
889 existing rules and algorithm). It might require either a new rule in
890 IDNA or a modification to the structure of Category F to make
891 additions less tedious. It would do nothing for different ways to
892 form identical characters within the same script that were not
893 associated with decomposition and so would have to be used in
894 conjunction with other appropaches. Finally, for scripts (such as
895 Arabic) where there is a very strong preference to avoid combining
896 sequences, this approach would exclude exactly the wrong set of
897 characters.
899 5.3. Disallow the combining sequences for these characters
901 As in the approach discussed in Section 5.1, this approach is too
902 Arabic-specific to address the more general problem. However, it
903 illustrates a single-script approach and a possible mechanism for
904 excluding combining sequences whose handling is connected to language
905 information (information that, as discussed above, is not relevant to
906 the DNS).
908 If chosen by the community, this subsection would update the portion
909 of the IDNA2008 specification that identifies contextual rules
910 [RFC5892] to prohibit (combining) Hamza Above (U+0654) in conjunction
911 with Arabic BEH (U+0628), HAH (U+062D), and REH (U+0631). Note that
912 the choice of this option is consistent with the general preference
913 for precomposed characters discussed above but would ban some labels
914 that are valid today and that might, in principle, be in use.
916 The required prohibition could be imposed by creating a new
917 contextual rule in RFC 5892 to constrain combining sequences
918 containing Hamza Above.
920 As the Unicode Standard points out at some length [Unicode70-Arabic],
921 Hamza is a problematic abstract character and the "Hamza Above"
922 construction even more so. IDNA has historically associated
923 characters whose use is reasonable in some contexts but not others
924 with the special derived property "CONTEXTO" and then specified
925 specific, context-dependent, rules about where they may be used.
926 Because Hamza Above is problematic (and spawns edge cases, as
927 discussed in the Unicode Standard section cited above), it was
928 suggested that a contextual rule might be appropriate. There are at
929 least two reasons why a contextual rule would not be suitable for the
930 present situation.
932 1. As discussed above, the present situation is a normalization
933 stability and predictability problem, not a contextual one. Had
934 the same issues arisen with a newly-added precomposed character
935 that could previously be constructed from non-problematic base
936 and combining characters, it would be even more clearly a
937 normalization issue and, following the principles discussed there
938 and particularly in UAX 15 [UAX15-Exclusion], might not have been
939 assigned at all.
941 2. The contextual rule sets are designed around restricting the use
942 of code points to a particular script or adjacent to particular
943 characters within that script. Neither of these cases applies to
944 the newly-added character even if one could imagine rules for the
945 use of Hamza Above (U+0654) that would reflect the considerations
946 of Chapter 8 of Unicode 6.2. Even had the latter been desired,
947 it would be somewhat late now -- Hamza Above has been present as
948 a combining character (U+0654) in many versions of Unicode.
949 While that section of the Unicode Standard describes the issues,
950 it does not provide actionable guidance about what to do about it
951 for cases going forward or when visual identity is important.
953 5.4. Disallow all Combining Characters for Specific Scripts
955 [[CREF4: This subsevtion needs to be turned into prose, but the
956 follow bullet points are probably sufficient to identify the
957 issues.]]
959 Might work for Arabic and other "precomposed preference" scripts (see
960 Section 3.3.4; recommended by the Arabic language community for IDNs
961 [RFC5564]. Hopeless for Latin. Backwards incompatible. No effect
962 at all on special-use representations of identical characters within
963 a script (see Section 3.3.2.1 and Section 3.3.2.2).
965 5.5. Do Nothing Other Than Warn
967 The recommendation from UTC is to simply warn registries, at all
968 levels of the tree, to be careful with this set of characters, making
969 language distinctions within zones. Because the DNS cannot make or
970 enforce language distinctions, this suggestion is problematic but it
971 would avoid having the IETF either invalidating label strings that
972 are potentially now in use or creating inconsistencies among the
973 characters that combine with Hamza Above but that also have
974 precomposed forms that do not have decompositions. The potential
975 would still exist for registries to respect the warning and deprecate
976 such labels if they existed.
978 5.6. Normalization Form IETF (NFI))
980 The most radical possibility for the comparison issue would be to
981 decide that none of the Unicode Normalization Forms specified in UAX
982 15 [UAX15] are adequate for use with the DNS because, contrary to
983 their apparent descriptions, normalization tables are actually
984 determined using language information. However, use of language
985 information is unacceptable for IDNA for reasons described elsewhere
986 in this document. The remedy would be to define an IETF-specific (or
987 DNS-specific) normalization form (sometimes called "NFI" in
988 discussions), building on NFC but adhering strictly to the rule that
989 normalization causes two different forms of the same character (glyph
990 image) within the same script to be treated as equal. In practice
991 such a form could be implemented for IDNA purposes as an additional
992 rule within RFC 5892 (and its successors) that constituted an
993 exception list for the NFC tables. For this set of characters, the
994 special IETF normalization form would be equivalent to the exclusion
995 discussed in Section 5.3 above.
997 An Internet-specific normalization form, especially if specified
998 somewhat separately from the IDNA core, would have a small marginal
999 advantage over the other strategies in this section (or in
1000 combination with some of them), even though most of the end result
1001 and much of the implementation would be the same in practice. While
1002 the design of IDNA requires that strings be normalized as part of the
1003 process of determining label validity (and hence before either
1004 storage of values in the DNS or name resolution), there is an ongoing
1005 debate about whether normalization should be performed before storing
1006 a string or putting it on the wire or only when the string is
1007 actually compared or otherwise used.
1009 If a normalization procedure with the right properties for the IETF
1010 was defined, that argument could be bypassed and the best decisions
1011 made for different circumstances. The separation would also allow
1012 better comparison of strings that lack language context in
1013 applications environments in which the additional processing and
1014 character classifications of IDNA and/or PRECIS were not applicable.
1015 Having such a normalization procedure defined outside IDNA would also
1016 minimize changes to IDNA itself, which is probably an advantage.
1018 If the new normalizstion form were, in practice, simply an overlay on
1019 NFC with modifications dictated by exception and/or property lists,
1020 keeping its definition separate from IDNA would also avoid
1021 interweaving those exceptions and property lists with the rules and
1022 categories of IDNA itself, avoiding some unnecessary complexity.
1024 6. Editorial clarification to RFC 5892
1026 Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a
1027 clarification to Appendix A and Section A.1 of RFC 5892. This
1028 section of this document updates the RFC to apply that clarification.
1030 1. In Appendix A, add a new paragraph after the paragraph that
1031 begins "The code point...". The new paragraph should read:
1033 "For the rule to be evaluated to True for the label, it MUST be
1034 evaluated separately for every occurrence of the Code point in
1035 the label; each of those evaluations must result in True."
1037 2. In Appendix A, Section A.1, replace the "Rule Set" by
1039 Rule Set:
1040 False;
1041 If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;
1042 If cp .eq. \u200C And
1043 RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp
1044 (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
1046 7. Acknowledgements
1048 The Unicode 7.0.0 changes were extensively discussed within the IAB's
1049 Internationalization Program. The authors are grateful for the
1050 discussions and feedback there, especially from Andrew Sullivan and
1051 David Thaler. Additional information was requested and received from
1052 Mark Davis and Ken Whistler and while they probably do not agree with
1053 the necessity of excluding this code point or taking even more
1054 drastic action as their responsibility is to look at the Unicode
1055 Consortium requirements for stability, the decision would not have
1056 been possible without their input. Thanks to Bill McQuillan and Ted
1057 Hardie for reading versions of the document carefully enough to
1058 identify and report some confusing typographical errors. Several
1059 experts and reviewers who prefer to remain anonymous also provided
1060 helpful input and comments on preliminary versions of this document.
1062 8. IANA Considerations
1064 When the IANA registry and tables are updated to reflect Unicode
1065 7.0.0, changes should be made according to the decisions the IETF
1066 makes about Section 5.
1068 9. Security Considerations
1070 From at least one point of view, this document is entirely a
1071 discussion of a security issue or set of such issues. While the
1072 "similar-looking characters" issue that has been a concern since the
1073 earliest days of IDNs [HomographAttack] and that has driven assorted
1074 "character confusion" projects [ICANN-VIP], if a user types in a
1075 string on one device and can get different results that do not
1076 compare equal when it is typed on a different device (with both
1077 behaving correctly and both keyboards appearing to be the same and
1078 for the same script) then all security mechanism that depend on the
1079 underlying identifiers, including the practical applications of DNS
1080 response integrity checks DNSSEC [RFC4033] and DNS-embedded public
1081 key mechanisms [RFC6698], are at risk if different parties, at least
1082 one of them malicious, obtain some of the identical-appearing and
1083 identically-typed strings.
1085 Mechanisms that depend on trusting registration systems (e.g.,
1086 registries and registrars in the DNS IDN case, see Section 5.5 above)
1087 are likely to be of only limited utility because fully-qualified
1088 domains that may be perfectly reasonable at the first level or two of
1089 the DNS may have differences of this type deep in the tree, into
1090 levels where name management is weak. Similar issues obviously apply
1091 when names are user-selected or unmanaged.
1093 When the issue is not a deliberate attack but simple accidental
1094 confusion among similar strings, most of our strategies depend on the
1095 acceptability of false negatives on matching if there is low risk of
1096 false positives (see, for example, the discussion of false negatives
1097 in identifier comparison in Section 2.1 of RFC 6943 [RFC6943]).
1098 Aspects of that issue appear in, for example, RFC 3986 [RFC3986] and
1099 the PRECIS effort [PRECIS-Framework]. But, because the cases covered
1100 here are connected, not just to what the user sees but to what is
1101 typed and where, there is an increased risk of false positives
1102 (accidental as well as deliberate).
1104 [[CREF5: Note in Draft: The paragraph that follows was written for a
1105 much earlier version of this document. It is obsolete, but is being
1106 retained as a placeholder for future developments.]]
1107 This specification excludes a code point for which the Unicode-
1108 specified normalization behavior could result in two ways to form a
1109 visually-identical character within the same script not comparing
1110 equal. That behavior could create a dream case for someone intending
1111 to confuse the user by use of a domain name that looked identical to
1112 another one, was entirely in the same script, but was still
1113 considered different.
1115 Internet Security in areas that involve internationalized identifiers
1116 that might contain the relevant characters is therefore significantly
1117 dependent on some effective resolution for the issues identified in
1118 this document, not just hand waving, devout wishes, or appointment of
1119 study committees about it.
1121 10. References
1123 10.1. Normative References
1125 [PRECIS-Framework]
1126 Saint-Andre, P. and M. Blanchet, "PRECIS Framework:
1127 Preparation, Enforcement, and Comparison of
1128 Internationalized Strings in Application Protocols",
1129 February 2015, .
1132 [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", BCP
1133 137, RFC 5137, February 2008.
1135 [RFC5890] Klensin, J., "Internationalized Domain Names for
1136 Applications (IDNA): Definitions and Document Framework",
1137 RFC 5890, August 2010.
1139 [RFC5892] Faltstrom, P., "The Unicode Code Points and
1140 Internationalized Domain Names for Applications (IDNA)",
1141 RFC 5892, August 2010.
1143 [RFC5892Erratum]
1144 "RFC5892, "The Unicode Code Points and Internationalized
1145 Domain Names for Applications (IDNA)", August 2010, Errata
1146 ID: 3312", Errata ID 3312, August 2012,
1147 .
1149 [RFC5894] Klensin, J., "Internationalized Domain Names for
1150 Applications (IDNA): Background, Explanation, and
1151 Rationale", RFC 5894, August 2010.
1153 [RFC6943] Thaler, D., "Issues in Identifier Comparison for Security
1154 Purposes", RFC 6943, May 2013.
1156 [UAX15] Davis, M., Ed., "Unicode Standard Annex #15: Unicode
1157 Normalization Forms", June 2014,
1158 .
1160 [UAX15-Exclusion]
1161 "Unicode Standard Annex #15: ob. cit., Section 5",
1162 .
1165 [UAX15-Versioning]
1166 "Unicode Standard Annex #15, ob. cit., Section 3",
1167 .
1169 [UTS46] Davis, M. and M. Suignard, "Unicode Technical Standard
1170 #46: Unicode IDNA Compatibility Processing", Version
1171 7.0.0, June 2014, .
1173 [Unicod70-CompatDecomp]
1174 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1175 2.3: Compatibility Characters", Chapter 2, 2014,
1176 .
1178 Subsection titled "Compatibility Decomposable Characters"
1179 starting on page 26.
1181 [Unicod70-Overlay]
1182 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1183 2.2: Unicode Design Principles", Chapter 2, 2014,
1184 .
1186 Subsection titled "Non-decomposition of Overlaid
1187 Diacritics" starting on page 64.
1189 [Unicode5]
1190 The Unicode Consortium, "The Unicode Standard, Version
1191 5.0", ISBN 0-321-48091-0, 2007.
1193 Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0.
1194 This printed reference has now been updated online to
1195 reflect additional code points. For code points, the
1196 reference at the time RFC 5890-5894 were published is to
1197 Unicode 5.2.
1199 [Unicode62]
1200 The Unicode Consortium, "The Unicode Standard, Version
1201 6.2.0", ISBN 978-1-936213-07-8, 2012,
1202 .
1204 Preferred citation: The Unicode Consortium. The Unicode
1205 Standard, Version 6.2.0, (Mountain View, CA: The Unicode
1206 Consortium, 2012. ISBN 978-1-936213-07-8)
1208 [Unicode7]
1209 The Unicode Consortium, "The Unicode Standard, Version
1210 7.0.0", ISBN 978-1-936213-09-2, 2014,
1211 .
1213 Preferred Citation: The Unicode Consortium. The Unicode
1214 Standard, Version 7.0.0, (Mountain View, CA: The Unicode
1215 Consortium, 2014. ISBN 978-1-936213-09-2)
1217 [Unicode70-Arabic]
1218 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1219 9.2: Arabic", Chapter 9, 2014,
1220 .
1222 Subsection titled "Encoding Principles", paragraph
1223 numbered 4, starting on page 362.
1225 [Unicode70-Design]
1226 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1227 2.2: Unicode Design Principles", Chapter 2, 2014,
1228 .
1230 [Unicode70-Hamza]
1231 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1232 9.2: Arabic", Chapter 9, 2014,
1233 .
1235 Subsection titled "Combining Hamza Above" starting on page
1236 378.
1238 [Unicode70-Stability]
1239 "The Unicode Standard, Version 7.0.0, ob.cit., Chapter
1240 2.2: Unicode Design Principles", Chapter 2, 2014,
1241 .
1243 Subsection titled "Stability" starting on page 23 and
1244 containing a link to http://www.unicode.org/policies/
1245 stability_policy.html..
1247 10.2. Informative References
1249 [Dalby] Dalby, A., "Dictionary of Languages: The definitive
1250 reference to more than 400 languages", Columbia Univeristy
1251 Press , 2004.
1253 pages 206-207
1255 [Daniels] Daniels, P. and W. Bright, "The World's Writing Systems",
1256 Oxford University Press , 1986.
1258 [HomographAttack]
1259 Gabrilovich, E. and A. Gontmakher, "The Homograph Attack",
1260 Communications of the ACM 45(2):128, February 2002,
1261 .
1264 [ICANN-VIP]
1265 ICANN, "The IDN Variant Issues Project: A Study of Issues
1266 Related to the Management of IDN Variant TLDs (Integrated
1267 Issues Report)", February 2012,
1268 .
1271 [Omniglot-Fula]
1272 Ager, S., "Omniglot: Fula (Fulfulde, Pulaar,
1273 Pular'Fulaare)",
1274 .
1276 Captured 2015-01-07
1278 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
1279 "Internationalizing Domain Names in Applications (IDNA)",
1280 RFC 3490, March 2003.
1282 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1283 Resource Identifier (URI): Generic Syntax", STD 66, RFC
1284 3986, January 2005.
1286 [RFC4033] Arends, R., Austein, R., Larson, M., Massey, D., and S.
1287 Rose, "DNS Security Introduction and Requirements", RFC
1288 4033, March 2005.
1290 [RFC5564] El-Sherbiny, A., Farah, M., Oueichek, I., and A. Al-Zoman,
1291 "Linguistic Guidelines for the Use of the Arabic Language
1292 in Internet Domains", RFC 5564, February 2010.
1294 [RFC6452] Faltstrom, P. and P. Hoffman, "The Unicode Code Points and
1295 Internationalized Domain Names for Applications (IDNA) -
1296 Unicode 6.0", RFC 6452, November 2011.
1298 [RFC6698] Hoffman, P. and J. Schlyter, "The DNS-Based Authentication
1299 of Named Entities (DANE) Transport Layer Security (TLS)
1300 Protocol: TLSA", RFC 6698, August 2012.
1302 [Unicode32]
1303 The Unicode Consortium, "The Unicode Standard, Version
1304 3.2.0", .
1306 The Unicode Standard, Version 3.2.0 is defined by The
1307 Unicode Standard, Version 3.0 (Reading, MA, Addison-
1308 Wesley, 2000. ISBN 0-201-61633-5), as amended by the
1309 Unicode Standard Annex #27: Unicode 3.1
1310 (http://www.unicode.org/reports/tr27/) and by the Unicode
1311 Standard Annex #28: Unicode 3.2
1312 (http://www.unicode.org/reports/tr28/).
1314 Appendix A. Change Log
1316 RFC Editor: Please remove this appendix before publication.
1318 A.1. Changes from version -00 to -01
1320 o Version 01 of this document is an extensive rewrite and
1321 reorganization, reflecting discussions with UTC members and adding
1322 three more options for discussion to the original proposal to
1323 simply disallow the new code point.
1325 A.2. Changes from version -01 to -02
1327 Corrected a typographical error in which Hamza Above was incorrectly
1328 listed with the wrong code point.
1330 A.3. Changes from version -02 to -03
1332 Corrected a typographical error in the Abstract in which RFC 5892 was
1333 incorrectly shown as 5982.
1335 A.4. Changes from version -03 to -04
1337 o Explicitly identified the applicability of U+08A1 with Fula and
1338 added references that discuss that language and how it is written.
1340 o Updated several Unicode 6.2 references to point to Unicode 7.0
1341 since the latter is now available in stable form (it was done when
1342 work on this I-D started).
1344 o Extensively revised to discuss the non-Arabic cases, non-
1345 decomposing diacritics, other types of characters that don't
1346 compare equal after normalization, and more general problem and
1347 approaches.
1349 Authors' Addresses
1351 John C Klensin
1352 1770 Massachusetts Ave, Ste 322
1353 Cambridge, MA 02140
1354 USA
1356 Phone: +1 617 245 1457
1357 Email: john-ietf@jck.com
1359 Patrik Faltstrom
1360 Netnod
1361 Franzengatan 5
1362 Stockholm 112 51
1363 Sweden
1365 Phone: +46 70 6059051
1366 Email: paf@netnod.se