< draft-klensin-idna-5892upd-unicode70-00.txt   draft-klensin-idna-5892upd-unicode70-01.txt >
Network Working Group J.C. Klensin Network Working Group J. Klensin
Internet-Draft P. Faltstrom Internet-Draft
Updates: 5982 (if approved) Netnod Updates: 5892, 5894 (if approved) P. Faltstrom
Intended status: Standards Track July 21, 2014 Intended status: Standards Track Netnod
Expires: January 20, 2015 Expires: June 10, 2015 December 7, 2014
IDNA Update for Unicode 7.0.0 IDNA Update for Unicode 7.0.0
draft-klensin-idna-5892upd-unicode70-00.txt draft-klensin-idna-5892upd-unicode70-01.txt
Abstract Abstract
The current version of the IDNA specifications anticipated that each The current version of the IDNA specifications anticipated that each
new version of Unicode would be reviewed to verify that no changes new version of Unicode would be reviewed to verify that no changes
had been introduced that required adjustments to the set of rules had been introduced that required adjustments to the set of rules
and, in particular, whether new exceptions or backward compatibility and, in particular, whether new exceptions or backward compatibility
adjustments were needed. That review was conducted for Unicode 7.0.0 adjustments were needed. That review was conducted for Unicode 7.0.0
and identified a problematic new code point. This specification and identified a potentially problematic new code point. This
updates RFC 5982 to disallow that code point and provides information specification discusses that code point and associated issues and
about the reasons why that exclusion is appropriate. It also applies updates RFC 5982 accordingly. It also applies an editorial
an editorial clarification that was the subject of an earlier clarification that was the subject of an earlier erratum. In
erratum. addition, the discussion of the specific issue updates RFC 5894.
Status of this Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 20, 2015. This Internet-Draft will expire on June 10, 2015.
Copyright Notice Copyright Notice
Copyright (c) 2014 IETF Trust and the persons identified as the Copyright (c) 2014 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (http://trustee.ietf.org/ Provisions Relating to IETF Documents
license-info) in effect on the date of publication of this document. (http://trustee.ietf.org/license-info) in effect on the date of
Please review these documents carefully, as they describe your rights publication of this document. Please review these documents
and restrictions with respect to this document. Code Components carefully, as they describe your rights and restrictions with respect
extracted from this document must include Simplified BSD License text to this document. Code Components extracted from this document must
as described in Section 4.e of the Trust Legal Provisions and are include Simplified BSD License text as described in Section 4.e of
provided without warranty as described in the Simplified BSD License. the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Change to RFC 5892 for new character U+08A1 . . . . . . . . . 4 2. Problem Description . . . . . . . . . . . . . . . . . . . . . 5
3. Editorial clarification to RFC 5892 . . . . . . . . . . . . . 4 2.1. IDNA assumptions about Unicode normalization . . . . . . 5
4. Explanation . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2. New code point U+08A1, decomposition, and language
4.1. A related historical problem . . . . . . . . . . . . . . . 6 dependency . . . . . . . . . . . . . . . . . . . . . . . 6
4.2. How this is being done . . . . . . . . . . . . . . . . . . 7 2.3. Other examples of the same behavior . . . . . . . . . . . 7
4.2.1. Backward compatibility and normalization . . . . . . . 7 2.4. Hamza and Combining Sequences . . . . . . . . . . . . . . 8
4.2.2. A new contextual rule . . . . . . . . . . . . . . . . 7 3. Proposed/ Alternative Changes to RFC 5892 for new character
5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 8 U+08A1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 3.1. Disallow This New Code Point . . . . . . . . . . . . . . 9
7. Security Considerations . . . . . . . . . . . . . . . . . . . 8 3.2. Disallow the combining sequences for these characters . . 10
8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3. Do Nothing Other Than Warn . . . . . . . . . . . . . . . 11
8.1. Normative References . . . . . . . . . . . . . . . . . . . 9 3.4. Normalization Form IETF (or DNS) . . . . . . . . . . . . 11
8.2. Informative References . . . . . . . . . . . . . . . . . . 10 4. Editorial clarification to RFC 5892 . . . . . . . . . . . . . 11
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12
7. Security Considerations . . . . . . . . . . . . . . . . . . . 12
8. References . . . . . . . . . . . . . . . . . . . . . . . . . 13
8.1. Normative References . . . . . . . . . . . . . . . . . . 13
8.2. Informative References . . . . . . . . . . . . . . . . . 14
Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 15
A.1. Changes from version -00 to -01 . . . . . . . . . . . . . 15
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15
1. Introduction 1. Introduction
The current version of the IDNA specifications, known as "IDNA2008" The current version of the IDNA specifications, known as "IDNA2008"
[RFC5890], anticipated that each new version of Unicode would be [RFC5890], anticipated that each new version of Unicode would be
reviewed to verify that no changes had been introduced that required reviewed to verify that no changes had been introduced that required
adjustments to IDNA's rules and, in particular, whether new adjustments to IDNA's rules and, in particular, whether new
exceptions or backward compatibility adjustments were needed. When exceptions or backward compatibility adjustments were needed. When
that review was carefully conducted for Unicode 7.0.0 [Unicode7], that review was carefully conducted for Unicode 7.0.0 [Unicode7],
comparing it to prior versions including the text in Unicode 6.2 comparing it to prior versions including the text in Unicode 6.2
[Unicode62], it identified a problematic new code point (U+08A1, [Unicode62], it identified a problematic new code point (U+08A1,
ARABIC LETTER BEH WITH HAMZA ABOVE). Section 2 of this specification ARABIC LETTER BEH WITH HAMZA ABOVE). The specific problem is
updates the portion of the IDNA2008 specification that identifies discussed in detail in Section 2. The behavior of that code point,
rules for what characters are permitted [RFC5892] to disallow that while non-optimal for IDNA, follows that of a few code points that
code point. It also provides information about the reasons why that predate Unicode 7.x and even the IDNA 2008 specifications and Unicode
exclusion is appropriate. 6.0. Those existing code points make the question of what, if
anything, to do about this new one exceedingly problematic because
different reasonable criteria yield different decisions,
specifically:
o To disallow it as an IDNA exception case creates inconsistencies
with how those earlier code points were handled.
o To disallow it and the similar code points as well would
necessitate invalidating some potential labels that would have
been valid under IDNA2008 until this time. However, there is
reason to believe that no such labels exist.
o To permit the new code point to be treated as PVALID creates a
situation in which it is possible, within the same script, to
compose the same character symbol (glyph) in two different ways
that do not compare equal even after normalization. That
condition would then apply to it and the earlier code points with
the same behavior. That situation contradicts a fundamental
assumption of IDNA that is discussed in more detail below.
NOTE IN DRAFT:
This working draft discusses four alternatives, including, for
illustration, a radical idea that seems too drastic to be
considered now although it would have been appropriate to discuss
when the IDNA2008 specifications were being developed. The
authors suggest that the community discuss the relevant tradeoffs
and make a decision and that the document then be revised to
reflect that decision, with the other alternatives discussed as
options not chosen. Because there is no ideal choice, the
discussion of the issues in Section 2, is probably as or more
important than the particular choice of how to handle this code
point. In addition to providing information for this document,
that section should be considered as an updating addendum to RFC
5894 [RFC5894] and should be incorporated into any future revision
of that document.
As the result of this version of the document containing several
alternate proposals, some of the text is also a little bit
redundant. That will be corrected in future versions.
As anticipated when IDNA2008, and RFC 5892 in particular, were As anticipated when IDNA2008, and RFC 5892 in particular, were
written, exceptions and explicit updates are likely to be needed only written, exceptions and explicit updates are likely to be needed only
if there is disagreement between the Unicode Consortium's view about if there is disagreement between the Unicode Consortium's view about
what is best for the Standard and the IETF's view of what is best for what is best for the Standard and the IETF's view of what is best for
IDNs, the DNS, and IDNA. It was hoped that a situation would never IDNs, the DNS, and IDNA. It was hoped that a situation would never
arise in which the the two perspectives would disagree, but the arise in which the the two perspectives would disagree, but the
possibility was anticipated and considerable mechanism added to RFC possibility was anticipated and considerable mechanism added to RFC
5890 and 5982 as a result. It is probably important to note that a 5890 and 5982 as a result. It is probably important to note that a
disagreement in this context does not imply that anyone is "wrong", disagreement in this context does not imply that anyone is "wrong",
only that the two different groups have different needs and therefore only that the two different groups have different needs and therefore
criteria about what is acceptable. For that reason, the IETF has, in criteria about what is acceptable. For that reason, the IETF has, in
the past, allowed some characters for IDNA that active Unicode the past, allowed some characters for IDNA that active Unicode
Technical Committee members suggested be disallowed to avoid a change Technical Committee members suggested be disallowed to avoid a change
in derived tables [RFC6452]. This document describes a case where in derived tables [RFC6452]. This document describes a case where
the IETF should disallow a character that the various properties the IETF should disallow a character or characters that the various
would otherwise treat as PVALID. properties would otherwise treat as PVALID.
This document provides the "flagging for the IESG" specified by This document provides the "flagging for the IESG" specified by
Section 5.1 of RFC 5892. As specified there, the change itself Section 5.1 of RFC 5892. As specified there, the change itself
requires IETF review because it alters the rules of Section 2 of that requires IETF review because it alters the rules of Section 2 of that
document. document.
Readers of this document are expected to be familiar with Unicode Readers of this document are expected to be familiar with Unicode
terminology [Unicode62] and the IETF conventions for representing terminology [Unicode62] and the IETF conventions for representing
Unicode code points [RFC5137]. Unicode code points [RFC5137].
As a convenience to readers of RFC 5892 and to reduce the risks of As a convenience to readers of RFC 5892 and to reduce the risks of
confusion, this document also formally applies the content of an confusion, this document also formally applies the content of an
erratum to the text of the RFC (see Section 3) and so brings that RFC erratum to the text of the RFC (see Section 4) and so brings that RFC
up to date with all agreed changes. up to date with all agreed changes.
[[RFC Editor: please remove the following comment and note if they [[RFC Editor: please remove the following comment and note if they
get to you.]] get to you.]]
[[IESG: It might not be a bad idea to incorporate some version of [[IESG: It might not be a bad idea to incorporate some version of
the following into the Last Call announcement.]] the following into the Last Call announcement.]]
NOTE IN DRAFT to IETF Reviewers: The issues in this document, and NOTE IN DRAFT to IETF Reviewers: The issues in this document, and
particularly the extended discussion below of why this change to particularly the choices among options for either adding exception
RFC 5892 is necessary and appropriate, are fairly esoteric. cases to RFC 5892 or ignoring the issue, warning people, and
Understanding them requires that one have at least some hoping the results do not include serious problems, are fairly
esoteric. Understanding them requires that one have at least some
understanding of how the Arabic Script works and the reasons the understanding of how the Arabic Script works and the reasons the
Unicode Standard gives various Arabic Script characters a fairly Unicode Standard gives various Arabic Script characters a fairly
extended discussion. It also requires understanding of a number extended discussion [Unicode62-Arabic]. It also requires
of Unicode principles, including the Normalization Stability rules understanding of a number of Unicode principles, including the
as applied to new precomposed characters and guidelines for adding Normalization Stability rules [UAX15-Versioning] as applied to new
new characters. References are provided for those who want to precomposed characters and guidelines for adding new characters.
pursue them, but potential reviewers should assume that the There is considerable discussion of the issues in Section 2 and
background needed to understand the reasons for this change is no references are provided for those who want to pursue them, but
less deep in the subject matter than would be expected of someone potential reviewers should assume that the background needed to
reviewing a proposed change in, e.g., the fundamentals of BGP, TCP understand the reasons for this change is no less deep in the
congestion control, or some cryptographic algorithm. subject matter than would be expected of someone reviewing a
proposed change in, e.g., the fundamentals of BGP, TCP congestion
control, or some cryptographic algorithm. Put more bluntly, one's
ability to read or speak languages other than English, or even one
or more languages that use the Arabic script, does not make one an
expert in these matters.
2. Change to RFC 5892 for new character U+08A1 2. Problem Description
With the publication of this document, Section 2.6 ("Exceptions (F)") 2.1. IDNA assumptions about Unicode normalization
of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in
Category F so that the rule itself reads:
F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660, IDNA makes several assumptions about Unicode, Unicode "characters",
0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668, and the effects of normalization. Those assumptions were based on
0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6, careful reading of the Unicode Standard at the time [Unicode5],
06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 08A1, 0F0B, guided by advice and commitments by members of the Unicode Technical
3007, 302E, 302F, 3031, 3032, 3033, 3034, 3035, Committee. Those assumptions, and the associated requirements, are
303B, 30FB} necessitated by three properties of DNS labels that do not apply to
blocks of running text:
and then add to the subtable designated 1. There is no language context for a label. While particular DNS
"DISALLOWED -- Would otherwise have been PVALID" zones may impose restrictions, including language or script
after the line that begins "07FA", the additional line: restrictions, on what labels can be registered, neither the DNS
nor IDNA impose either type of restriction or give the user of a
label any indication about the registration or other restrictions
that may have been imposed.
08A1; DISALLOWED # ARABIC LETTER BEH WITH HAMZA ABOVE 2. Labels are often mnemonics rather than words in any language.
They may be abbreviations or acronyms or contain embedded digits
and have other characteristics that are not typical of words.
This has the effect of making the cited code point DISALLOWED 3. Labels are, in practice, usually short. Even when they are the
independent of application of the rest of the IDNA rule set to the maximum length allowed by the DNS and IDNA, they are typically
current version of Unicode. Those wishing to create domain name too short to provide significant context. Statements that
labels containing Beh with Hamza Above may continue to use the suggest that languages can almost always be determined from
sequence relatively short paragraphs or equivalent bodies of text do not
apply to DNS labels because of their typical short length and
because, as noted above, they are not required to be formed
according to language-based rules.
U+0628, ARABIC LETTER BEH At the same time, because the DNS is an exact-match system, there
followed by must be no ambiguity about whether two labels are equal. Although
there have been extensive discussions about "confusingly similar"
characters, labels, and strings, such tests between scripts are
always somewhat subjective: they are affected by choices of type
styles and by what the user expects to see. In spite of the fact
that the glyphs that represent many characters in different scripts
are identical in appearance (e.g., basic Latin "a" (U+0061) and the
identical-appearing Cyrillic character (U+0430), the most important
test is that, if two glyphs are the same within a given script, they
must represent the same character no matter how they are formed.
U+0654, ARABIC HAMZA ABOVE Unicode normalization, as explained in [UAX15], is expected to
resolve those "same script, same glyph, different formation methods"
issues. Within the Latin script, the code point sequence for lower
case "o" (U+006F) and combining diaeresis (U+0308) will, when
normalized using the "NFC" method required by IDNA, produce the
precombined small letter o with diaeresis (U+00F6) and hence the two
ways of forming the character will compare equal (and the combining
sequence is effectively prohibited from U-labels).
which was valid for IDNA purposes in Unicode 5.0 and earlier and NFC was preferred over other normalization methods for IDNA because
which continues to be valid. it is more compact, more likely to be produced on keyboards on which
the relevant characters actually appeared, and because it does not
lose substantive information (e.g., some types of compatibility
equivalence involves judgment calls as to whether two characters are
actually the same -- they may be "the same" in some contexts but not
others -- while canonical equivalence is about different ways to
produce the glyph for the same abstract character).
3. Editorial clarification to RFC 5892 IDNA also assumed that the extensive Unicode stability rules would be
applied and work as specified when new code points were added. Those
rules, as described in The Unicode Standard and the normative annexes
identified below, provide that:
Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a 1. New code points representing precombined characters that can be
clarification to Appendix A and Section A.1 of RFC 5892. This formed from combining sequences will not be added to Unicode
section of this document updates the RFC to apply that clarification. unless neither the relevant base character nor required combining
character are part of the Standard within the relevant script
[UAX15-Versioning].
1. In Appendix A, add a new paragraph after the paragraph that 2. If circumstances require that principle be violated,
begins "The code point...". The new paragraph should read: normalization stability requires that the newly-added character
decompose (even under NFC) to the previously-available combining
sequence [UAX15-Exclusion].
"For the rule to be evaluated to True for the label, it MUST be There is no explicit provision in the Standard's discussion of
evaluated separately for every occurrence of the Code point in the conditions for adding new code points, nor of normalization
label; each of those evaluations must result in True." stability, for an exception based on different languages using the
same script.
2. In Appendix A, Section A.1, replace the "Rule Set" by 2.2. New code point U+08A1, decomposition, and language dependency
Rule Set: Unicode 7.0.0 introduces the new code point U+08A1, ARABIC LETTER BEH
False; WITH HAMZA ABOVE. As can be deduced from the name, it is visually
If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True; identical to the glyph that can be formed from a combining sequence
If cp .eq. \u200C And consisting of the code point for ARABIC LETTER BEH (U+0628) and the
RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp code point for Combining Hamza Above (U+0654). The two rules
(Joining_Type:T)*(Joining_Type:{R,D})) Then True; summarized above suggest that either the new code point should not be
allocated at all or that it should have a decomposition to
\u'0628'\u'0654'.
4. Explanation Had the issues outlined in this document been better understood at
the time, it probably would have been wise for RFC 5892 to disallow
either the precomposed character or the combining sequence of each
pair in those cases in which Unicode normalization rules do not cause
the right thing to happen, i.e., the combining sequence and
precomposed character to be treated as equivalent. Failure to do so
at the time places an extra burden on registries to be sure that
conflicts (and the potential for confusion and attacks) do not exist.
Oddly, had the exclusion been made part of the specification at that
time, the preference for precombined forms noted above would probably
have dictated excluding the combining sequence, something not
otherwise done in IDNA2008 because the NFC requirement serves the
same purpose. Today, the only thing that can be excluded without the
potential disruption of disallowing a previously-PVALID combining
sequence is the to exclude the newly-added code point so whatever is
done, or might have been contemplated with hindsight, will be
somewhat inconsistent.
[[NOTE IN DRAFT: Given the nature of this document, we believe this 2.3. Other examples of the same behavior
material belongs here. It could, however, be moved to an appendix if
anyone felt strongly about that.]]
This section summarizes some of the discussions and reasoning that One of the things that complicates the issue with the new U+08A1 code
led to the conclusion and change in Section 2. It should not be point is that there are several other Arabic-script code points that
considered as either normative or authoritative. behave in the same way for similar language-specific reasons.
In particular, at least three other grapheme clusters that have been
present for many version of Unicode can be seen as involving issues
similar to those for the newly-added ARABIC LETTER BEH WITH HAMZA
ABOVE. ARABIC LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER
REH WITH HAMZA ABOVE (U+076C) do not have decomposition forms and are
preferred over combining sequences using HAMZA ABOVE (U+0654)
[Unicode62-Hamza]. By contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE
(U+0623) decomposes into \u'0627'\u'0653' and ARABIC LETTER YEH WITH
HAMZA ABOVE (U+0626) decomposes into \u'064A'\u'0654' so the
precomposed character and combining sequences compare equal when both
are normalized, as this specification prefers.
There are other variations in which a precomposed character involving
HAMZA ABOVE has a decomposition to a combining sequence that can form
it. For example, ARABIC LETTER U WITH HAMZA ABOVE (U+0677) has a
compatibility (???) decomposition into the combining sequence
\u'06C7'\u'0674'.
2.4. Hamza and Combining Sequences
As the Unicode Standard points out at some length [Unicode62-Arabic], As the Unicode Standard points out at some length [Unicode62-Arabic],
Hamza is a problematic abstract character and the "Hamza Above" Hamza is a problematic abstract character and the "Hamza Above"
construction even more so [Unicode62-Hamza]. Those sections explain construction even more so [Unicode62-Hamza]. Those sections explain
a distinction made by Unicode between the use of a Hamza mark to a distinction made by Unicode between the use of a Hamza mark to
denote a glottal stop and one used as a diacritic mark to denote a denote a glottal stop and one used as a diacritic mark to denote a
separate letter. In the first case, the combining sequence is used. separate letter. In the first case, the combining sequence is used.
In the second, a precombined character is assigned. In the second, a precombined character is assigned.
Unlike Unicode generally and because of concerns about identifier Unlike Unicode generally and because of concerns about identifier
spoofing and attacks based on similarities, character distinctions in spoofing and attacks based on similarities, character distinctions in
IDNA are based much more strictly on the appearance of characters; IDNA are based much more strictly on the appearance of characters;
pronunciation distinctions are not considered. So, for IDNA, BEH language and pronunciation distinctions within a script are not
WITH HAMZA ABOVE is not-quite-tautologically the same as BEH WITH considered. So, for IDNA, BEH WITH HAMZA ABOVE is not-quite-
HAMZA ABOVE, even if one of them is written as U+08A1 (new to Unicode tautologically the same as BEH WITH HAMZA ABOVE, even if one of them
7.0.0) and the other as the sequence \u'0628'\u'0654' (feasible with is written as U+08A1 (new to Unicode 7.0.0) and the other as the
Unicode 7.0.0 but also available in versions of Unicode going back at sequence \u'0628'\u'0654' (feasible with Unicode 7.0.0 but also
least to the original publication of RFC 5892). Because the two available in versions of Unicode going back at least to the version
are, for IDNA purposes, the same, IDNA expects that normalization [Unicode32] used in the original version of IDNA [RFC3490]. Because
(specifically the requirement that all U-labels be in NFC form) will the precomposed form and combining sequence are, for IDNA purposes,
cause them to compare equal. the same, IDNA expects that normalization (specifically the
requirement that all U-labels be in NFC form) will cause them to
compare equal.
If Unicode also considered them the same, then the principle would If Unicode also considered them the same, then the principle would
apply that new precomposed ("composition") forms are not added unless apply that new precomposed ("composition") forms are not added unless
one of the code points that could be used to construct it did not one of the code points that could be used to construct it did not
exist in an earlier version (and even then is exist in an earlier version (and even then is
discouraged)[UAX15-Versioning]. When exceptions are made, they are discouraged)[UAX15-Versioning]. When exceptions are made, they are
expected to conform to the rules and classes in the "Composition expected to conform to the rules and classes in the "Composition
Exclusion Table", with class 2 being relevant to this case Exclusion Table", with class 2 being relevant to this case
[UAX15-Exclusion]. That rule essentially requires that the [UAX15-Exclusion]. That rule essentially requires that the
normalization for the old combining sequence to itself be retained normalization for the old combining sequence to itself be retained
(for stability) but that the newly-added character be treated as (for stability) but that the newly-added character be treated as
canonically decomposable and decompose back to the older sequence canonically decomposable and decompose back to the older sequence
even under NFC. That was not done for this particular case, even under NFC. That was not done for this particular case,
presumably because of the distinction about prounciation modifiers presumably because of the distinction about pronunciation modifiers
versus separate letters noted above. Because, for IDNA and the DNS, versus separate letters noted above. Because, for IDNA and the DNS,
there is a possibility that the composing sequence \u'0628'\u'0654' there is a possibility that the composing sequence \u'0628'\u'0654'
already appears in labels, the only choice other than allowing an already appears in labels, the only choice other than allowing an
otherwise-identical, and identically-appearing, label with U+08A1 otherwise-identical, and identically-appearing, label with U+08A1
substituted to identify a different DNS entry is to DISALLOW the new substituted to identify a different DNS entry is to DISALLOW the new
character. character.
4.1. A related historical problem 3. Proposed/ Alternative Changes to RFC 5892 for new character U+08A1
At least three other grapheme clusters have been present for many NOTE IN DRAFT: See the comments in the Introduction, Section 1 and
version of Unicode and can be seen as involving issues similar to the first paragraph of each Subsection below for the status of the
those for the newly-added ARABIC LETTER BEH WITH HAMZA ABOVE. ARABIC Subsections that follow. Each one, in combination with the material
LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER REH WITH HAMZA in Section 2 above, also provides information about the reasons why
ABOVE (U+076C) do not have decomposition forms and are preferred over that particular strategy is appropriate.
combining sequences using HAMZA ABOVE (U+0654) [Unicode62-Hamza]. By
contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE (U+0623) decomposes
into \u'0627'\u'0653' and ARABIC LETTER YEH WITH HAMZA ABOVE (U+0626)
decomposes into \u'064A'\u'0654' so the precomposed character and
combining sequences compare equal when both are normalized, as this
specification prefers.
There are other variations on this theme. For example, ARABIC LETTER 3.1. Disallow This New Code Point
U WITH HAMZA ABOVE (U+0677) has a compatibility decomposition into
the combining sequence \u'06C7'\u'0674'.
Had the issues outlined in this document been better understood at If chosen by the community, this subsection would update the portion
the time, it probably would have been wise for RFC 5892 to disallow of the IDNA2008 specification that identifies rules for what
either the precomposed character or the combining sequence of each characters are permitted [RFC5892] to disallow that code point.
pair unless Unicode normalization rules cause the right thing to
happen. Failure to do so at the time places an extra burden on
registries to be sure that conflicts (and the potential for confusion
and attacks) do not exist. Oddly, had the exclusion been made part
of the specification at that time, the preference noted above would
probably have dictated excluding the combining sequence, something
not otherwise done in IDNA2008. Today, the only thing that can be
excluded without the potential disruption of disallowing a
previously-PVALID combining sequence is the newly-added code point so
whatever is done, or might have been contemplated with hindsight, it
would be somewhat inconsistent.
4.2. How this is being done With the publication of this document, Section 2.6 ("Exceptions (F)")
of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in
Category F so that the rule itself reads:
Questions have arisen has to why this specification makes the change F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660,
to RFC 5892 by DISALLOWing U+08A1 as a simple exception (IDNA 0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668,
Category F, RFC 5892 Section 2.7) rather than either a backward- 0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6,
compatibility case (IDNA Category G, RFC 5982 Section 2.8) or 06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 08A1, 0F0B,
modifying IDNA Category F to make Hamza (or Hamza Above, or combining 3007, 302E, 302F, 3031, 3032, 3033, 3034, 3035,
Hamza generally) into CONTEXTO cases and specifying appropriate 303B, 30FB}
limitations in a new entry in the IANA IDNA Context Registry (as
specified in RFC 5892 Section 5.2). The subsections below explain
why neither of those alternatives was chosen despite some discussion
of each.
4.2.1. Backward compatibility and normalization and then add to the subtable designated
"DISALLOWED -- Would otherwise have been PVALID"
after the line that begins "07FA", the additional line:
The "BackwardCompatible" category (IDNA Category G, RFC 5892 Section 08A1; DISALLOWED # ARABIC LETTER BEH WITH HAMZA ABOVE
5.3) is described as applying only when "property values in versions
of Unicode after 5.2 have changed in such a way that the derived
property value would no longer be PVALID or DISALLOWED". Because
U+08A1 is a newly-added code point in Unicode 7.0.0 and no property
values of code points in prior versions have changed, that category G
does not apply. If that section of RFC 5892 is replaced in the
future, perhaps consideration should be given to adding Normalization
Stability and other issues to that description but, at present, it is
not relevant.
4.2.2. A new contextual rule This has the effect of making the cited code point DISALLOWED
independent of application of the rest of the IDNA rule set to the
current version of Unicode. Those wishing to create domain name
labels containing Beh with Hamza Above may continue to use the
sequence
U+0628, ARABIC LETTER BEH
followed by
U+0654, ARABIC HAMZA ABOVE
which was valid for IDNA purposes in Unicode 5.0 and earlier and
which continues to be valid.
In principle, much the same thing could be accomplished by using the
IDNA "BackwardCompatible" category (IDNA Category G, RFC 5892
Section 5.3). However, that category is described as applying only
when "property values in versions of Unicode after 5.2 have changed
in such a way that the derived property value would no longer be
PVALID or DISALLOWED". Because U+08A1 is a newly-added code point in
Unicode 7.0.0 and no property values of code points in prior versions
have changed, category G does not apply. If that section of RFC 5892
were to be replaced in the future, perhaps consideration should be
given to adding Normalization Stability and other issues to that
description but, at present, it is not relevant.
3.2. Disallow the combining sequences for these characters
If chosen by the community, this subsection would update the portion
of the IDNA2008 specification that identifies contextual rules
[RFC5892] to prohibit (combining) Hamza Above (U+0654) in conjunction
with Arabic BEH (U+0628), HAH (U+062D), and REH (U+0631). Note that
the choice of this option is consistent with the general preference
for precomposed characters discussed above but would ban some labels
that are valid today and that might, in principle, be in use.
The required prohibition could be imposed by creating a new
contextual rule in RFC 5892 to constrain combining sequences
containing Hamza Above.
As the Unicode Standard points out at some length [Unicode62-Arabic], As the Unicode Standard points out at some length [Unicode62-Arabic],
Hamza is a problematic abstract character and the "Hamza Above" Hamza is a problematic abstract character and the "Hamza Above"
construction even more so. IDNA has historically associated construction even more so. IDNA has historically associated
characters whose use is reasonable in some contexts but not others characters whose use is reasonable in some contexts but not others
with the special derived property "CONTEXTO" and then specified with the special derived property "CONTEXTO" and then specified
specific, context-dependent, rules about where they may be used. specific, context-dependent, rules about where they may be used.
Because Hamza Above is problematic (and spawns edge cases, as Because Hamza Above is problematic (and spawns edge cases, as
discussed in the Unicode Standard section cited above), it was discussed in the Unicode Standard section cited above), it was
suggested that a contextual rule might be appropriate. There are at suggested that a contextual rule might be appropriate. There are at
least two reasons why a contextual rule would not be suitable for the least two reasons why a contextual rule would not be suitable for the
present situation. present situation.
1. As discussed above, the present situation is a normalization 1. As discussed above, the present situation is a normalization
stability and predictability problem, not a contextual one. Had stability and predictability problem, not a contextual one. Had
the same issues arisen with a newly-added precomposed character the same issues arisen with a newly-added precomposed character
that could previously be constructed from non-problematic base that could previously be constructed from non-problematic base
and combining characters, it would be even more clearly a and combining characters, it would be even more clearly a
normalization issue and, following the principles discussed there normalization issue and, following the principles discussed there
and particularly in UAX 15 [UAX15-Exclusion], might not have been and particularly in UAX 15 [UAX15-Exclusion], might not have been
skipping to change at page 8, line 26 skipping to change at page 11, line 12
characters within that script. Neither of these cases applies to characters within that script. Neither of these cases applies to
the newly-added character even if one could imagine rules for the the newly-added character even if one could imagine rules for the
use of Hamza Above (U+0654) that would reflect the considerations use of Hamza Above (U+0654) that would reflect the considerations
of Chapter 8 of Unicode 6.2. Even had the latter been desired, of Chapter 8 of Unicode 6.2. Even had the latter been desired,
it would be somewhat late now -- Hamza Above has been present as it would be somewhat late now -- Hamza Above has been present as
a combining character (U+0654) in many versions of Unicode. a combining character (U+0654) in many versions of Unicode.
While that section of the Unicode Standard describes the issues, While that section of the Unicode Standard describes the issues,
it does not provide actionable guidance about what to do about it it does not provide actionable guidance about what to do about it
for cases going forward or when visual identity is important. for cases going forward or when visual identity is important.
3.3. Do Nothing Other Than Warn
The recommendation from UTC is to simply warn registries, at all
levels of the tree, to be careful with this set of characters, making
language distinctions within zones. Because the DNS cannot make or
enforce language distinctions, this suggestion is problematic but it
would avoid having the IETF either invalidating label strings that
are potentially now in use or creating inconsistencies among the
characters that combine with Hamza Above but that also have
precomposed forms that do not have decompositions. The potential
would still exist for registries to respect the warning and deprecate
such labels if they existed.
3.4. Normalization Form IETF (or DNS)
The most radical possibility would be to decide that none of the
Unicode Normalization Forms specified in UAX 15 [UAX15] are adequate
for use with the DNS because, contrary to their apparent
descriptions, normalization tables are actually determined using
language information. However, use of language information is
unacceptable for IDNA for reasons described elsewhere in this
document. The remedy would be to define an IETF-specific (or DNS-
specific) normalization form, building on NFC but adhering strictly
to the rule that normalization causes two different forms of the same
character (glyph image) within the same script to be treated as
equal. In practice such a form would be implemented for IDNA
purposes as an additional rule within RFC 5892 (and its successors)
that constituted an exception list for the NFC tables. For this set
of characters, the special IETF normalization form would be
equivalent to the exclusion discussed in Section 3.2 above.
4. Editorial clarification to RFC 5892
Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a
clarification to Appendix A and Section A.1 of RFC 5892. This
section of this document updates the RFC to apply that clarification.
1. In Appendix A, add a new paragraph after the paragraph that
begins "The code point...". The new paragraph should read:
"For the rule to be evaluated to True for the label, it MUST be
evaluated separately for every occurrence of the Code point in
the label; each of those evaluations must result in True."
2. In Appendix A, Section A.1, replace the "Rule Set" by
Rule Set:
False;
If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;
If cp .eq. \u200C And
RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp
(Joining_Type:T)*(Joining_Type:{R,D})) Then True;
5. Acknowledgements 5. Acknowledgements
The Unicode 7.0.0 changes were extensively discussed within the IAB's The Unicode 7.0.0 changes were extensively discussed within the IAB's
Internationalization Program. The authors are grateful for the Internationalization Program. The authors are grateful for the
discussions and feedback there, especially from Andrew Sullivan and discussions and feedback there, especially from Andrew Sullivan and
David Thaler. Additional information was requested and received from David Thaler. Additional information was requested and received from
Mark Davis and Ken Whistler and while they probably do not agree with Mark Davis and Ken Whistler and while they probably do not agree with
the necessity of excluding this code point as their responsibility is the necessity of excluding this code point or taking even more
to look at the Unicode Consortium requirements for stability, the drastic action as their responsibility is to look at the Unicode
decision would not have been possible without their input. Several Consortium requirements for stability, the decision would not have
experts and reviewers who prefer to remain anonymous also provided been possible without their input. Several experts and reviewers who
helpful input and comments on preliminary versions of this document. prefer to remain anonymous also provided helpful input and comments
on preliminary versions of this document.
6. IANA Considerations 6. IANA Considerations
When the IANA registry and tables are updated to reflect Unicode When the IANA registry and tables are updated to reflect Unicode
7.0.0, code point U+08A1 should be identified as DISALLOWED, 7.0.0, changes should be made according to the decisions the IETF
consistent with the change made in Section 2. makes about Section 3.
7. Security Considerations 7. Security Considerations
[[CREF1: NOTE IN DRAFT: This section is unchanged in version -01 of
this document relative to what appeared in -00. It will need to be
rewritten once decisions are made about what path to follow. In
particular, if "just warn" is chosen, it will need to contain very
strong warnings.]]
This specification excludes a code point for which the Unicode- This specification excludes a code point for which the Unicode-
specified normalization behavior could result in two ways to form a specified normalization behavior could result in two ways to form a
visually-identical character within the same script not comparing visually-identical character within the same script not comparing
equal. That behavior could create a dream case for someone equal. That behavior could create a dream case for someone intending
intending to confuse the user by use of a domain name that looked to confuse the user by use of a domain name that looked identical to
identical to another one, was entirely in the same script, but was another one, was entirely in the same script, but was still
still considered different (see, for example, the discussion of false considered different (see, for example, the discussion of false
negatives in identifier comparison in Section 2.1 of RFC 6943 negatives in identifier comparison in Section 2.1 of RFC 6943
[RFC6943]). This exclusion therefore should improve Internet [RFC6943]). This exclusion therefore should improve Internet
security. security.
8. References 8. References
8.1. Normative References 8.1. Normative References
[RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", BCP [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", BCP
137, RFC 5137, February 2008. 137, RFC 5137, February 2008.
[RFC5890] Klensin, J., "Internationalized Domain Names for [RFC5890] Klensin, J., "Internationalized Domain Names for
Applications (IDNA): Definitions and Document Framework", Applications (IDNA): Definitions and Document Framework",
RFC 5890, August 2010. RFC 5890, August 2010.
[RFC5892] Faltstrom, P., "The Unicode Code Points and
Internationalized Domain Names for Applications (IDNA)",
RFC 5892, August 2010.
[RFC5892Erratum] [RFC5892Erratum]
"RFC5892, "The Unicode Code Points and Internationalized "RFC5892, "The Unicode Code Points and Internationalized
Domain Names for Applications (IDNA)", August 2010, Errata Domain Names for Applications (IDNA)", August 2010, Errata
ID: 3312", Errata ID 3312, August 2012, <http://www.rfc- ID: 3312", Errata ID 3312, August 2012,
editor.org/errata_search.php?rfc=5892>. <http://www.rfc-editor.org/errata_search.php?rfc=5892>.
[RFC5892] Faltstrom, P., "The Unicode Code Points and [RFC5894] Klensin, J., "Internationalized Domain Names for
Internationalized Domain Names for Applications (IDNA)", Applications (IDNA): Background, Explanation, and
RFC 5892, August 2010. Rationale", RFC 5894, August 2010.
[RFC6943] Thaler, D., "Issues in Identifier Comparison for Security [RFC6943] Thaler, D., "Issues in Identifier Comparison for Security
Purposes", RFC 6943, May 2013. Purposes", RFC 6943, May 2013.
[UAX15] Davis, M., Ed., "Unicode Standard Annex #15: Unicode
Normalization Forms", June 2014,
<http://www.unicode.org/reports/tr15/>.
[UAX15-Exclusion] [UAX15-Exclusion]
Davis, M., Ed., "Unicode Standard Annex #15: Unicode "Unicode Standard Annex #15: ob. cit., Section 5",
Normalization Forms, Section 5", June 2014, <http:// <http://www.unicode.org/reports/
www.unicode.org/reports/tr15/ tr15/#Primary_Exclusion_List_Table>.
#Primary_Exclusion_List_Table>.
[UAX15-Versioning] [UAX15-Versioning]
Davis, M., Ed., "Unicode Standard Annex #15: Unicode "Unicode Standard Annex #15, ob. cit., Section 3",
Normalization Forms, Section 3", June 2014, <http:// <http://www.unicode.org/reports/tr15/#Versioning>.
www.unicode.org/reports/tr15/#Versioning>.
[Unicode5]
The Unicode Consortium, "The Unicode Standard, Version
5.0", ISBN 0-321-48091-0, 2007.
Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0.
This printed reference has now been updated online to
reflect additional code points. For code points, the
reference at the time RFC 5890-5894 were published is to
Unicode 5.2.
[Unicode62]
The Unicode Consortium, "The Unicode Standard, Version
6.2.0", ISBN 978-1-936213-07-8, 2012,
<http://www.unicode.org/versions/Unicode6.2.0/>.
Preferred citation: The Unicode Consortium. The Unicode
Standard, Version 6.2.0, (Mountain View, CA: The Unicode
Consortium, 2012. ISBN 978-1-936213-07-8)
[Unicode62-Arabic] [Unicode62-Arabic]
"The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8", "The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8",
Chapter 8, 2012, <http://www.unicode.org/versions/ Chapter 8, 2012,
Unicode6.2.0/ch08.pdf>. <http://www.unicode.org/versions/Unicode6.2.0/ch08.pdf>.
Subsection titled "Encoding Principles", paragraph Subsection titled "Encoding Principles", paragraph
numbered 4, starting on page 251. numbered 4, starting on page 251.
[Unicode62-Hamza] [Unicode62-Hamza]
"The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8", "The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8",
Chapter 8, 2012, <http://www.unicode.org/versions/ Chapter 8, 2012,
Unicode6.2.0/ch08.pdf>. <http://www.unicode.org/versions/Unicode6.2.0/ch08.pdf>.
Subsection titled "Combining Hamza Above" starting on page Subsection titled "Combining Hamza Above" starting on page
263. 263.
[Unicode62]
The Unicode Consortium, "The Unicode Standard, Version
6.2.0", ISBN 978-1-936213-07-8, 2012, <http://
www.unicode.org/versions/Unicode6.2.0/>.
Preferred citation: The Unicode Consortium. The Unicode
Standard, Version 6.2.0, (Mountain View, CA: The Unicode
Consortium, 2012. ISBN 978-1-936213-07-8)
[Unicode7] [Unicode7]
The Unicode Consortium, "The Unicode Standard, Version The Unicode Consortium, "The Unicode Standard, Version
7.0.0", ISBN 978-1-936213-09-2, 2014, <http:// 7.0.0", ISBN 978-1-936213-09-2, 2014,
www.unicode.org/versions/Unicode7.0.0/>. <http://www.unicode.org/versions/Unicode7.0.0/>.
Preferred Citation: The Unicode Consortium. The Unicode Preferred Citation: The Unicode Consortium. The Unicode
Standard, Version 7.0.0, (Mountain View, CA: The Unicode Standard, Version 7.0.0, (Mountain View, CA: The Unicode
Consortium, 2014. ISBN 978-1-936213-09-2) Consortium, 2014. ISBN 978-1-936213-09-2)
8.2. Informative References 8.2. Informative References
[RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
"Internationalizing Domain Names in Applications (IDNA)",
RFC 3490, March 2003.
[RFC6452] Faltstrom, P. and P. Hoffman, "The Unicode Code Points and [RFC6452] Faltstrom, P. and P. Hoffman, "The Unicode Code Points and
Internationalized Domain Names for Applications (IDNA) - Internationalized Domain Names for Applications (IDNA) -
Unicode 6.0", RFC 6452, November 2011. Unicode 6.0", RFC 6452, November 2011.
[Unicode32]
The Unicode Consortium, "The Unicode Standard, Version
3.2.0", .
The Unicode Standard, Version 3.2.0 is defined by The
Unicode Standard, Version 3.0 (Reading, MA, Addison-
Wesley, 2000. ISBN 0-201-61633-5), as amended by the
Unicode Standard Annex #27: Unicode 3.1
(http://www.unicode.org/reports/tr27/) and by the Unicode
Standard Annex #28: Unicode 3.2
(http://www.unicode.org/reports/tr28/).
Appendix A. Change Log
RFC Editor: Please remove this appendix before publication.
A.1. Changes from version -00 to -01
o Version 01 of this document is an extensive rewrite and
reorganization, reflecting discussions with UTC members and adding
three more options for discussion to the original proposal to
simply disallow the new code point.
Authors' Addresses Authors' Addresses
John C Klensin John C Klensin
1770 Massachusetts Ave, Ste 322 1770 Massachusetts Ave, Ste 322
Cambridge, MA 02140 Cambridge, MA 02140
USA USA
Phone: +1 617 245 1457 Phone: +1 617 245 1457
Email: john-ietf@jck.com Email: john-ietf@jck.com
Patrik Faltstrom Patrik Faltstrom
Netnod Netnod
Franzengatan 5 Franzengatan 5
Stockholm, 112 51 Stockholm 112 51
Sweden Sweden
Phone: +46 70 6059051 Phone: +46 70 6059051
Email: paf@netnod.se Email: paf@netnod.se
 End of changes. 61 change blocks. 
205 lines changed or deleted 448 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/