< draft-klensin-idna-5892upd-unicode70-03.txt   draft-klensin-idna-5892upd-unicode70-04.txt >
Network Working Group J. Klensin Network Working Group J. Klensin
Internet-Draft Internet-Draft
Updates: 5892, 5894 (if approved) P. Faltstrom Updates: 5892, 5894 (if approved) P. Faltstrom
Intended status: Standards Track Netnod Intended status: Standards Track Netnod
Expires: July 10, 2015 January 6, 2015 Expires: September 11, 2015 March 10, 2015
IDNA Update for Unicode 7.0.0 IDNA Update for Unicode 7.0.0
draft-klensin-idna-5892upd-unicode70-03.txt draft-klensin-idna-5892upd-unicode70-04.txt
Abstract Abstract
The current version of the IDNA specifications anticipated that each The current version of the IDNA specifications anticipated that each
new version of Unicode would be reviewed to verify that no changes new version of Unicode would be reviewed to verify that no changes
had been introduced that required adjustments to the set of rules had been introduced that required adjustments to the set of rules
and, in particular, whether new exceptions or backward compatibility and, in particular, whether new exceptions or backward compatibility
adjustments were needed. That review was conducted for Unicode 7.0.0 adjustments were needed. The review for Unicode 7.0.0 first
and identified a potentially problematic new code point. This identified a potentially problematic new code point and then a much
specification discusses that code point and associated issues and more general and difficult issue with Unicode normalization. This
updates RFC 5892 accordingly. It also applies an editorial specification discusses those issues and proposes updates to IDNA
clarification that was the subject of an earlier erratum. In and, potentially, the way the IETF handles comparison of identifiers
addition, the discussion of the specific issue updates RFC 5894. more generally, especially when there is no associated language or
language identification. It also applies an editorial clarification
to RFC 5892 that was the subject of an earlier erratum and updates
RFC 5894 to point to the issues involved.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on July 10, 2015. This Internet-Draft will expire on September 11, 2015.
Copyright Notice Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Problem Description . . . . . . . . . . . . . . . . . . . . . 5 2. Document Aspirations . . . . . . . . . . . . . . . . . . . . 6
2.1. IDNA assumptions about Unicode normalization . . . . . . 5 3. Problem Description . . . . . . . . . . . . . . . . . . . . . 7
2.2. New code point U+08A1, decomposition, and language 3.1. IDNA assumptions about Unicode normalization . . . . . . 7
dependency . . . . . . . . . . . . . . . . . . . . . . . 6 3.2. The discovery and the Arabic script cases . . . . . . . . 9
2.3. Other examples of the same behavior . . . . . . . . . . . 7 3.2.1. New code point U+08A1, decomposition, and language
2.4. Hamza and Combining Sequences . . . . . . . . . . . . . . 8 dependency . . . . . . . . . . . . . . . . . . . . . 9
3. Proposed/ Alternative Changes to RFC 5892 for new character 3.2.2. Other examples of the same behavior within the Arabic
U+08A1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Script . . . . . . . . . . . . . . . . . . . . . . . 10
3.1. Disallow This New Code Point . . . . . . . . . . . . . . 9 3.2.3. Hamza and Combining Sequences . . . . . . . . . . . . 10
3.2. Disallow the combining sequences for these characters . . 10 3.3. Precomposed characters without decompositions more
3.3. Do Nothing Other Than Warn . . . . . . . . . . . . . . . 11 generally . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4. Normalization Form IETF (or DNS) . . . . . . . . . . . . 11 3.3.1. Description of the general problem . . . . . . . . . 11
4. Editorial clarification to RFC 5892 . . . . . . . . . . . . . 11 3.3.2. Latin Examples and Cases . . . . . . . . . . . . . . 12
5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 3.3.3. Examples and Cases from Other Scripts . . . . . . . . 14
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 3.3.4. Scripts with precomposed preferences and ones with
7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 combining preferences . . . . . . . . . . . . . . . . 15
8. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4. Confusion and the casual user . . . . . . . . . . . . . . 15
8.1. Normative References . . . . . . . . . . . . . . . . . . 13 4. Implementation options and issues: Unicode properties,
8.2. Informative References . . . . . . . . . . . . . . . . . 15 exceptions, and the nature of stability . . . . . . . . . . . 15
Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 15 4.1. Unicode Stability compared to IETF (and ICANN) Stability 15
A.1. Changes from version -00 to -01 . . . . . . . . . . . . . 15 4.2. New Unicode Properties . . . . . . . . . . . . . . . . . 17
A.2. Changes from version -01 to -02 . . . . . . . . . . . . . 15 4.3. The need for exception lists . . . . . . . . . . . . . . 18
A.3. Changes from version -02 to -03 . . . . . . . . . . . . . 15 5. Proposed/ Alternative Changes to RFC 5892 for the issues
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 first exposed by new code point U+08A1 . . . . . . . . . . . 18
5.1. Disallow This New Code Point . . . . . . . . . . . . . . 18
5.2. Disallow This New Code Point and All Future Precomposed
Additions that do not decompose . . . . . . . . . . . . . 19
5.3. Disallow the combining sequences for these characters . . 19
5.4. Disallow all Combining Characters for Specific Scripts . 21
5.5. Do Nothing Other Than Warn . . . . . . . . . . . . . . . 21
5.6. Normalization Form IETF (NFI)) . . . . . . . . . . . . . 21
6. Editorial clarification to RFC 5892 . . . . . . . . . . . . . 22
7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 23
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 23
9. Security Considerations . . . . . . . . . . . . . . . . . . . 23
10. References . . . . . . . . . . . . . . . . . . . . . . . . . 24
10.1. Normative References . . . . . . . . . . . . . . . . . . 24
10.2. Informative References . . . . . . . . . . . . . . . . . 27
Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 28
A.1. Changes from version -00 to -01 . . . . . . . . . . . . . 28
A.2. Changes from version -01 to -02 . . . . . . . . . . . . . 28
A.3. Changes from version -02 to -03 . . . . . . . . . . . . . 29
A.4. Changes from version -03 to -04 . . . . . . . . . . . . . 29
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 29
1. Introduction 1. Introduction
Note in/about -04 Draft: This version of the document contains a
very large amount of new material as compared to the -03 version.
The new material reflects an evolution of community understanding
in the last two months from an assumption that the problem
involved only a few code points and one combining character in a
single script (Hamza Above and Arabic) to an understanding that it
is quite pervasive and may represent fundamental misunderstandings
or omissions from IDNA2008 (and, by extension, the basics of
PRECIS [PRECIS-Framework]) that must be corrected if those
protocols are going to be used in a way that supports Internet
internationalized identifiers predictability (as seen by the end
user) and security.
This version is still necessarily incomplete: not only is our
understanding probably still not comprehensive, but there are a
number of placeholders for text and references. Nonetheless, the
document in its current form should be useful as both the
beginning of a comprehensive overview is the issues and a source
of references to other relevant materials.
This draft could almost certainly be organized better to improve
its readability: specific suggestion would be welcome.
The current version of the IDNA specifications, known as "IDNA2008" The current version of the IDNA specifications, known as "IDNA2008"
[RFC5890], anticipated that each new version of Unicode would be [RFC5890], anticipated that each new version of Unicode would be
reviewed to verify that no changes had been introduced that required reviewed to verify that no changes had been introduced that required
adjustments to IDNA's rules and, in particular, whether new adjustments to IDNA's rules and, in particular, whether new
exceptions or backward compatibility adjustments were needed. When exceptions or backward compatibility adjustments were needed. When
that review was carefully conducted for Unicode 7.0.0 [Unicode7], that review was carefully conducted for Unicode 7.0.0 [Unicode7],
comparing it to prior versions including the text in Unicode 6.2 comparing it to prior versions including the text in Unicode 6.2
[Unicode62], it identified a problematic new code point (U+08A1, [Unicode62], it identified a problematic new code point (U+08A1,
ARABIC LETTER BEH WITH HAMZA ABOVE). The specific problem is ARABIC LETTER BEH WITH HAMZA ABOVE). The code point was added for
discussed in detail in Section 2. The behavior of that code point, use with the Fula (also known as Fulfulde, Pulaar, amd Pular'Fulaare)
while non-optimal for IDNA, follows that of a few code points that language, a language that, apparently, is most often written in Latin
predate Unicode 7.x and even the IDNA 2008 specifications and Unicode characters today [Omniglot-Fula] [Dalby] [Daniels].
6.0. Those existing code points make the question of what, if
anything, to do about this new one exceedingly problematic because The specific problem is discussed in detail in Section 3. In very
broad terms, IDNA (and other IETF work) assume that, if one can
represent "the same character" either as a combining sequence or as a
single code point, strings that are identical except for those
alternate forms will compare equal after normalization. Part of the
difficulty that has characterized this discussion is that "the same"
differs depending on the criteria that are chosen.
The behavior of the newly-added code point, while non-optimal for
IDNA, follows that of a few code points that predate Unicode 7.x and
even the IDNA 2008 specifications and Unicode 6.0. Those existing
code points, which may not be easy to accurately characterize as a
group, make the question of what, if anything, to do about this new
exceedingly problematic one and, perhaps separately, what to do about
existing sets of code points with the same behavior, because
different reasonable criteria yield different decisions, different reasonable criteria yield different decisions,
specifically: specifically:
o To disallow it as an IDNA exception case creates inconsistencies o To disallow it (and future, but not existing characters with
with how those earlier code points were handled. similar characteristics) as an IDNA exception case creates
inconsistencies with how those earlier code points were handled.
o To disallow it and the similar code points as well would o To disallow it and the similar code points as well would
necessitate invalidating some potential labels that would have necessitate invalidating some potential labels that would have
been valid under IDNA2008 until this time. However, there is been valid under IDNA2008 until this time. Depending on how the
reason to believe that no such labels exist. collection of similar code points is characterized, a few of them
are almost certainly used in reasonable labels.
o To permit the new code point to be treated as PVALID creates a o To permit the new code point to be treated as PVALID creates a
situation in which it is possible, within the same script, to situation in which it is possible, within the same script, to
compose the same character symbol (glyph) in two different ways compose the same character symbol (glyph) in two different ways
that do not compare equal even after normalization. That that do not compare equal even after normalization. That
condition would then apply to it and the earlier code points with condition would then apply to it and the earlier code points with
the same behavior. That situation contradicts a fundamental the same behavior. That situation contradicts a fundamental
assumption of IDNA that is discussed in more detail below. assumption of IDNA that is discussed in more detail below.
NOTE IN DRAFT: NOTE IN DRAFT:
This working draft discusses four alternatives, including, for This working draft discusses six alternatives, including an idea
illustration, a radical idea that seems too drastic to be (an IETF-specific normalization form) that seemed too drastic to
considered now although it would have been appropriate to discuss be considered a few months ago. However, it not only would have
when the IDNA2008 specifications were being developed. The been appropriate to discuss when the IDNA2008 specifications were
authors suggest that the community discuss the relevant tradeoffs being developed but is appearing more attractive now. The authors
and make a decision and that the document then be revised to suggest that the community discuss the relevant tradeoffs and make
reflect that decision, with the other alternatives discussed as a decision and that the document then be revised to reflect that
options not chosen. Because there is no ideal choice, the decision, with the other alternatives discussed as options not
discussion of the issues in Section 2, is probably as or more chosen. Because there is no ideal choice, the discussion of the
important than the particular choice of how to handle this code issues in Section 3, is probably as or more important than the
point. In addition to providing information for this document, particular choice of how to handle this code point. In addition
that section should be considered as an updating addendum to RFC to providing information for this document, that section should be
5894 [RFC5894] and should be incorporated into any future revision considered as an updating addendum to RFC 5894 [RFC5894] and
of that document. should be incorporated into any future revision of that document.
As the result of this version of the document containing several As the result of this version of the document containing several
alternate proposals, some of the text is also a little bit alternate proposals, some of the text is also a little bit
redundant. That will be corrected in future versions. redundant. That will be corrected in future versions.
As anticipated when IDNA2008, and RFC 5892 in particular, were As anticipated when IDNA2008, and RFC 5892 in particular, were
written, exceptions and explicit updates are likely to be needed only written, exceptions and explicit updates are likely to be needed only
if there is disagreement between the Unicode Consortium's view about if there is disagreement between the Unicode Consortium's view about
what is best for the Standard and the IETF's view of what is best for what is best for the Standard and the IETF's view of what is best for
IDNs, the DNS, and IDNA. It was hoped that a situation would never IDNs, the DNS, and IDNA. It was hoped that a situation would never
skipping to change at page 4, line 21 skipping to change at page 5, line 35
Technical Committee members suggested be disallowed to avoid a change Technical Committee members suggested be disallowed to avoid a change
in derived tables [RFC6452]. This document describes a case where in derived tables [RFC6452]. This document describes a case where
the IETF should disallow a character or characters that the various the IETF should disallow a character or characters that the various
properties would otherwise treat as PVALID. properties would otherwise treat as PVALID.
This document provides the "flagging for the IESG" specified by This document provides the "flagging for the IESG" specified by
Section 5.1 of RFC 5892. As specified there, the change itself Section 5.1 of RFC 5892. As specified there, the change itself
requires IETF review because it alters the rules of Section 2 of that requires IETF review because it alters the rules of Section 2 of that
document. document.
Readers of this document are expected to be familiar with Unicode
terminology [Unicode62] and the IETF conventions for representing
Unicode code points [RFC5137].
As a convenience to readers of RFC 5892 and to reduce the risks of
confusion, this document also formally applies the content of an
erratum to the text of the RFC (see Section 4) and so brings that RFC
up to date with all agreed changes.
[[RFC Editor: please remove the following comment and note if they [[RFC Editor: please remove the following comment and note if they
get to you.]] get to you.]]
[[IESG: It might not be a bad idea to incorporate some version of [[IESG: It might not be a bad idea to incorporate some version of
the following into the Last Call announcement.]] the following into the Last Call announcement.]]
NOTE IN DRAFT to IETF Reviewers: The issues in this document, and NOTE IN DRAFT to IETF Reviewers: The issues in this document, and
particularly the choices among options for either adding exception particularly the choices among options for either adding exception
cases to RFC 5892 or ignoring the issue, warning people, and cases to RFC 5892 or ignoring the issue, warning people, and
hoping the results do not include serious problems, are fairly hoping the results do not include serious problems, are fairly
esoteric. Understanding them requires that one have at least some esoteric. Understanding them requires that one have at least some
understanding of how the Arabic Script works and the reasons the understanding of how the Arabic Script (and perhaps other scripts
Unicode Standard gives various Arabic Script characters a fairly in which precomposed characters are preferred over combining
extended discussion [Unicode62-Arabic]. It also requires sequences as a Unicode design and extension principle) works and
understanding of a number of Unicode principles, including the the reasons the Unicode Standard gives various Arabic Script
Normalization Stability rules [UAX15-Versioning] as applied to new characters a fairly extended discussion [Unicode70-Arabic]. It
precomposed characters and guidelines for adding new characters. also requires understanding of a number of Unicode principles,
There is considerable discussion of the issues in Section 2 and including the Normalization Stability rules [UAX15-Versioning] as
references are provided for those who want to pursue them, but applied to new precomposed characters and guidelines for adding
potential reviewers should assume that the background needed to new characters. There is considerable discussion of the issues in
understand the reasons for this change is no less deep in the Section 3 and references are provided for those who want to pursue
subject matter than would be expected of someone reviewing a them, but potential reviewers should assume that the background
proposed change in, e.g., the fundamentals of BGP, TCP congestion needed to understand the reasons for this change is no less deep
control, or some cryptographic algorithm. Put more bluntly, one's in the subject matter than would be expected of someone reviewing
ability to read or speak languages other than English, or even one a proposed change in, e.g., the fundamentals of BGP, TCP
or more languages that use the Arabic script, does not make one an congestion control, or some cryptographic algorithm. Put more
expert in these matters. bluntly, one's ability to read or speak languages other than
English, or even one or more languages that use the Arabic script
or other scripts similarly affected, does not make one an expert
in these matters.
2. Problem Description This document assumes that the reader is reasonably familiar with the
terminology of IDNA [RFC5890] and Unicode [Unicode7] and with the
IETF conventions for representing Unicode code points [RFC5137].
Some terms used here may not be used in the same way in those two
sets of documents. From one point of view, those differences may
have been the results of, or led to, misunderstandings that may, in
turn, be part of the root cause of the problems explored in this
document. In particular, this document uses the term "precomposed
character" to describe characters that could reasonably be composed
by a combining sequence using code points in the same but for which a
single code point that does not require combining sequences is
available. That definition is strictly about mechanical composition
and does not involve any considerations about how the character is
used. It is closely related to this document's definition of
"identical". When a precomposed character exists and either applying
NFC to the combining sequence does not yield that character or
applying NFD to that character's code point does not yield the
combining sequence, it is referred to in this document as "non-
decomposable"
2.1. IDNA assumptions about Unicode normalization 2. Document Aspirations
This document, in its present form, is not a proposal for a solution.
Instead, it is intended to be (or evolve into) a comprehensive
description of the issues and problems and to outline some possible
approaches to a solution. A perfect solution -- one that would
resolve all of the issues identified in this document, would involve
a relatively small set of relatively simple rules and hence would be
comprehensible and predictable for and by non-expert end users, would
not require code point by code point or even block by block exception
lists, and would not leave uses of any script or language feeling
that their particular writing system have been treated less fairly
than others.
Part of the reality we need to accept is that IDNA, in its present
form, represents compromises that does not completely satisfy those
criteria and whatever is done about these issues will probably make
it (or the job of administering zones containing IDNs) more complex.
Similarly, as the Unicode Standard suggests when it identifies ten
Design Principles and the text then says "Not all of these principles
can be satisfied simultaneously..." [Unicode70-Design], while there
are guidelines and principles, a certain amount of subjective
judgment is involved in making determinations about normalization,
decomposition, and some property values. For Unicode itself, those
issues are resolved by multiple statements (at least one cited below)
that one needs to rely on per-code point information in the Unicode
Character Database rather than on rules or principles. The design of
IDNA and the effort to keep it largely independent of Unicode
versions requires rules, categories, and principles that can be
relied upon and applied algorithmically. There is obviously some
tension between the two approaches.
3. Problem Description
3.1. IDNA assumptions about Unicode normalization
IDNA makes several assumptions about Unicode, Unicode "characters", IDNA makes several assumptions about Unicode, Unicode "characters",
and the effects of normalization. Those assumptions were based on and the effects of normalization. Those assumptions were based on
careful reading of the Unicode Standard at the time [Unicode5], careful reading of the Unicode Standard at the time [Unicode5],
guided by advice and commitments by members of the Unicode Technical guided by advice and commitments by members of the Unicode Technical
Committee. Those assumptions, and the associated requirements, are Committee. Those assumptions, and the associated requirements, are
necessitated by three properties of DNS labels that do not apply to necessitated by three properties of DNS labels that typically do not
blocks of running text: apply to blocks of running text:
1. There is no language context for a label. While particular DNS 1. There is no language context for a label. While particular DNS
zones may impose restrictions, including language or script zones may impose restrictions, including language or script
restrictions, on what labels can be registered, neither the DNS restrictions, on what labels can be registered, neither the DNS
nor IDNA impose either type of restriction or give the user of a nor IDNA impose either type of restriction or give the user of a
label any indication about the registration or other restrictions label any indication about the registration or other restrictions
that may have been imposed. that may have been imposed.
2. Labels are often mnemonics rather than words in any language. 2. Labels are often mnemonics rather than words in any language.
They may be abbreviations or acronyms or contain embedded digits They may be abbreviations or acronyms or contain embedded digits
skipping to change at page 6, line 12 skipping to change at page 8, line 24
are identical in appearance (e.g., basic Latin "a" (U+0061) and the are identical in appearance (e.g., basic Latin "a" (U+0061) and the
identical-appearing Cyrillic character (U+0430), the most important identical-appearing Cyrillic character (U+0430), the most important
test is that, if two glyphs are the same within a given script, they test is that, if two glyphs are the same within a given script, they
must represent the same character no matter how they are formed. must represent the same character no matter how they are formed.
Unicode normalization, as explained in [UAX15], is expected to Unicode normalization, as explained in [UAX15], is expected to
resolve those "same script, same glyph, different formation methods" resolve those "same script, same glyph, different formation methods"
issues. Within the Latin script, the code point sequence for lower issues. Within the Latin script, the code point sequence for lower
case "o" (U+006F) and combining diaeresis (U+0308) will, when case "o" (U+006F) and combining diaeresis (U+0308) will, when
normalized using the "NFC" method required by IDNA, produce the normalized using the "NFC" method required by IDNA, produce the
precombined small letter o with diaeresis (U+00F6) and hence the two precomposed small letter o with diaeresis (U+00F6) and hence the two
ways of forming the character will compare equal (and the combining ways of forming the character will compare equal (and the combining
sequence is effectively prohibited from U-labels). sequence is effectively prohibited from U-labels).
NFC was preferred over other normalization methods for IDNA because NFC was preferred over other normalization methods for IDNA because
it is more compact, more likely to be produced on keyboards on which it is more compact, more likely to be produced on keyboards on which
the relevant characters actually appeared, and because it does not the relevant characters actually appeared, and because it does not
lose substantive information (e.g., some types of compatibility lose substantive information (e.g., some types of compatibility
equivalence involves judgment calls as to whether two characters are equivalence involves judgment calls as to whether two characters are
actually the same -- they may be "the same" in some contexts but not actually the same -- they may be "the same" in some contexts but not
others -- while canonical equivalence is about different ways to others -- while canonical equivalence is about different ways to
produce the glyph for the same abstract character). produce the glyph for the same abstract character).
IDNA also assumed that the extensive Unicode stability rules would be IDNA also assumed that the extensive Unicode stability rules would be
applied and work as specified when new code points were added. Those applied and work as specified when new code points were added. Those
rules, as described in The Unicode Standard and the normative annexes rules, as described in The Unicode Standard and the normative annexes
identified below, provide that: identified below, provide that:
1. New code points representing precombined characters that can be 1. New code points representing precomposed characters that can be
formed from combining sequences will not be added to Unicode formed from combining sequences will not be added to Unicode
unless neither the relevant base character nor required combining unless neither the relevant base character nor required combining
character are part of the Standard within the relevant script character(s) are part of the Standard within the relevant script
[UAX15-Versioning]. [UAX15-Versioning].
2. If circumstances require that principle be violated, 2. If circumstances require that principle be violated,
normalization stability requires that the newly-added character normalization stability requires that the newly-added character
decompose (even under NFC) to the previously-available combining decompose (even under NFC) to the previously-available combining
sequence [UAX15-Exclusion]. sequence [UAX15-Exclusion].
There is no explicit provision in the Standard's discussion of At least at the time IDNA2008 was being developed, there was no
conditions for adding new code points, nor of normalization explicit provision in the Standard's discussion of conditions for
stability, for an exception based on different languages using the adding new code points, nor of normalization stability, for an
same script. exception based on different languages using the same script or
ambiguities about the shape or positioning of combining characters.
2.2. New code point U+08A1, decomposition, and language dependency 3.2. The discovery and the Arabic script cases
While the set of problems with normalization discussed above were
discovered with a newly-added code point for the Arabic Script and
some characteristics of Unicode handling of that script seem to make
the problem more complex going forward, these are not issues specific
to Arabic. This section describes the Arabic-specific problems;
subsequent ones (starting with Section 3.3) discuss the problem more
generally and include illustrations from other scripts.
3.2.1. New code point U+08A1, decomposition, and language dependency
Unicode 7.0.0 introduces the new code point U+08A1, ARABIC LETTER BEH Unicode 7.0.0 introduces the new code point U+08A1, ARABIC LETTER BEH
WITH HAMZA ABOVE. As can be deduced from the name, it is visually WITH HAMZA ABOVE. As can be deduced from the name, it is visually
identical to the glyph that can be formed from a combining sequence identical to the glyph that can be formed from a combining sequence
consisting of the code point for ARABIC LETTER BEH (U+0628) and the consisting of the code point for ARABIC LETTER BEH (U+0628) and the
code point for Combining Hamza Above (U+0654). The two rules code point for Combining Hamza Above (U+0654). The two rules
summarized above suggest that either the new code point should not be summarized above (see the last part of Section 3.1) suggest that
allocated at all or that it should have a decomposition to either the new code point should not be allocated at all or that it
\u'0628'\u'0654'. should have a decomposition to \u'0628'\u'0654'.
Had the issues outlined in this document been better understood at Had the issues outlined in this document been better understood at
the time, it probably would have been wise for RFC 5892 to disallow the time, it probably would have been wise for RFC 5892 to disallow
either the precomposed character or the combining sequence of each either the precomposed character or the combining sequence of each
pair in those cases in which Unicode normalization rules do not cause pair in those cases in which Unicode normalization rules do not cause
the right thing to happen, i.e., the combining sequence and the right thing to happen, i.e., the combining sequence and
precomposed character to be treated as equivalent. Failure to do so precomposed character to be treated as equivalent. Failure to do so
at the time places an extra burden on registries to be sure that at the time places an extra burden on registries to be sure that
conflicts (and the potential for confusion and attacks) do not exist. conflicts (and the potential for confusion and attacks) do not exist.
Oddly, had the exclusion been made part of the specification at that Oddly, had the exclusion been made part of the specification at that
time, the preference for precombined forms noted above would probably time, the preference for precomposed forms noted above would probably
have dictated excluding the combining sequence, something not have dictated excluding the combining sequence, something not
otherwise done in IDNA2008 because the NFC requirement serves the otherwise done in IDNA2008 because the NFC requirement serves the
same purpose. Today, the only thing that can be excluded without the same purpose. Today, the only thing that can be excluded without the
potential disruption of disallowing a previously-PVALID combining potential disruption of disallowing a previously-PVALID combining
sequence is the to exclude the newly-added code point so whatever is sequence is the to exclude the newly-added code point so whatever is
done, or might have been contemplated with hindsight, will be done, or might have been contemplated with hindsight, will be
somewhat inconsistent. somewhat inconsistent.
2.3. Other examples of the same behavior 3.2.2. Other examples of the same behavior within the Arabic Script
One of the things that complicates the issue with the new U+08A1 code One of the things that complicates the issue with the new U+08A1 code
point is that there are several other Arabic-script code points that point is that there are several other Arabic-script code points that
behave in the same way for similar language-specific reasons. behave in the same way for similar language-specific reasons.
In particular, at least three other grapheme clusters that have been In particular, at least three other grapheme clusters that have been
present for many version of Unicode can be seen as involving issues present for many version of Unicode can be seen as involving issues
similar to those for the newly-added ARABIC LETTER BEH WITH HAMZA similar to those for the newly-added ARABIC LETTER BEH WITH HAMZA
ABOVE. ARABIC LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER ABOVE. ARABIC LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER
REH WITH HAMZA ABOVE (U+076C) do not have decomposition forms and are REH WITH HAMZA ABOVE (U+076C) do not have decomposition forms and are
preferred over combining sequences using HAMZA ABOVE (U+0654) preferred over combining sequences using HAMZA ABOVE (U+0654)
[Unicode62-Hamza]. By contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE [Unicode70-Hamza]. By contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE
(U+0623) decomposes into \u'0627'\u'0654' and ARABIC LETTER YEH WITH (U+0623) decomposes into \u'0627'\u'0654', ARABIC LETTER WAW WITH
HAMZA ABOVE (U+0626) decomposes into \u'064A'\u'0654' so the HAMZA ABOVE (U+0624) decomposes into \u'0648'\u'0654', and ARABIC
precomposed character and combining sequences compare equal when both LETTER YEH WITH HAMZA ABOVE (U+0626) decomposes into \u'064A'\u'0654'
are normalized, as this specification prefers. so the precomposed character and combining sequences compare equal
when both are normalized, as this specification prefers.
There are other variations in which a precomposed character involving There are other variations in which a precomposed character involving
HAMZA ABOVE has a decomposition to a combining sequence that can form HAMZA ABOVE has a decomposition to a combining sequence that can form
it. For example, ARABIC LETTER U WITH HAMZA ABOVE (U+0677) has a it. For example, ARABIC LETTER U WITH HAMZA ABOVE (U+0677) has a
compatibility (???) decomposition into the combining sequence compatibility decomposition. but not a canonical one, into the
\u'06C7'\u'0674'. combining sequence \u'06C7'\u'0674'.
2.4. Hamza and Combining Sequences 3.2.3. Hamza and Combining Sequences
As the Unicode Standard points out at some length [Unicode62-Arabic], As the Unicode Standard points out at some length [Unicode70-Arabic],
Hamza is a problematic abstract character and the "Hamza Above" Hamza is a problematic abstract character and the "Hamza Above"
construction even more so [Unicode62-Hamza]. Those sections explain construction even more so [Unicode70-Hamza]. Those sections explain
a distinction made by Unicode between the use of a Hamza mark to a distinction made by Unicode between the use of a Hamza mark to
denote a glottal stop and one used as a diacritic mark to denote a denote a glottal stop and one used as a diacritic mark to denote a
separate letter. In the first case, the combining sequence is used. separate letter. In the first case, the combining sequence is used.
In the second, a precombined character is assigned. In the second, a precomposed character is assigned.
Unlike Unicode generally and because of concerns about identifier Unlike Unicode generally and because of concerns about identifier
spoofing and attacks based on similarities, character distinctions in spoofing and attacks based on similarities, character distinctions in
IDNA are based much more strictly on the appearance of characters; IDNA are based much more strictly on the appearance of characters;
language and pronunciation distinctions within a script are not language and pronunciation distinctions within a script are not
considered. So, for IDNA, BEH WITH HAMZA ABOVE is not-quite- considered. So, for IDNA, BEH WITH HAMZA ABOVE is not-quite-
tautologically the same as BEH WITH HAMZA ABOVE, even if one of them tautologically the same as BEH WITH HAMZA ABOVE, even if one of them
is written as U+08A1 (new to Unicode 7.0.0) and the other as the is written as U+08A1 (new to Unicode 7.0.0) and the other as the
sequence \u'0628'\u'0654' (feasible with Unicode 7.0.0 but also sequence \u'0628'\u'0654' (feasible with Unicode 7.0.0 but also
available in versions of Unicode going back at least to the version available in versions of Unicode going back at least to the version
[Unicode32] used in the original version of IDNA [RFC3490]. Because [Unicode32] used in the original version of IDNA [RFC3490]. Because
the precomposed form and combining sequence are, for IDNA purposes, the precomposed form and combining sequence are, for IDNA purposes,
the same, IDNA expects that normalization (specifically the the same, IDNA expects that normalization (specifically the
requirement that all U-labels be in NFC form) will cause them to requirement that all U-labels be in NFC form) will cause them to
compare equal. compare equal.
If Unicode also considered them the same, then the principle would If Unicode also considered them the same, then the principle would
apply that new precomposed ("composition") forms are not added unless apply that new precomposed ("composition") forms are not added unless
one of the code points that could be used to construct it did not one of the code points that could be used to construct it did not
exist in an earlier version (and even then is exist in an earlier version (and even then is discouraged)
discouraged)[UAX15-Versioning]. When exceptions are made, they are [UAX15-Versioning]. When exceptions are made, they are expected to
expected to conform to the rules and classes in the "Composition conform to the rules and classes in the "Composition Exclusion
Exclusion Table", with class 2 being relevant to this case Table", with class 2 being relevant to this case [UAX15-Exclusion].
[UAX15-Exclusion]. That rule essentially requires that the That rule essentially requires that the normalization for the old
normalization for the old combining sequence to itself be retained combining sequence to itself be retained (for stability) but that the
(for stability) but that the newly-added character be treated as newly-added character be treated as canonically decomposable and
canonically decomposable and decompose back to the older sequence decompose back to the older sequence even under NFC. That was not
even under NFC. That was not done for this particular case, done for this particular case, presumably because of the distinction
presumably because of the distinction about pronunciation modifiers about pronunciation modifiers versus separate letters noted above.
versus separate letters noted above. Because, for IDNA and the DNS, Because, for IDNA and the DNS, there is a possibility that the
there is a possibility that the composing sequence \u'0628'\u'0654' composing sequence \u'0628'\u'0654' already appears in labels, the
already appears in labels, the only choice other than allowing an only choice other than allowing an otherwise-identical, and
otherwise-identical, and identically-appearing, label with U+08A1 identically-appearing, label with U+08A1 substituted to identify a
substituted to identify a different DNS entry is to DISALLOW the new different DNS entry is to DISALLOW the new character.
character.
3. Proposed/ Alternative Changes to RFC 5892 for new character U+08A1 3.3. Precomposed characters without decompositions more generally
3.3.1. Description of the general problem
As mentioned above, IDNA made a strong assumption that, if there were
two ways to form the same abstract character in the same script,
normalization would result in them comparing equal. Work on IDNA2008
recognized that early version of Unicode might also contain some
inconsistencies; see Section 3.3.2.4 below.
Having precomposed code points exist that don't have decompositions,
or having them allocated in the future, is problematic for those IDNA
assumptions about character comparison, and seems to call for either
excludng some set of code points that IDNA's rules do not now
identify, to develop and use a normalization procedure that behaves
as expected (those two options may be nearly equivalent for many
purposes) or deciding to accept a risk that, apparently, will only
increase over time.
It is not clear whether the reasons the IDNABIS WG did not understand
and allow for these cases are important except insofar as they inform
considerations about what to do in the future. It seemed (and still
seems to some people) that the Unicode Standard is very clear on the
matter. In addition to the normalization stability rules cited in
the last part of Section 3.1. the discussion in the Core Standard
seems quite clear. For example, "Where characters are used in
different ways in different languages, the relevant properties are
normally defined outside the Unicode Standard" in Section 2.2,
subsection titled "Semantics" [Unicode7] did not suggest to most
readers that sometime separate code points would be allocated within
a script based on language considerations. Similarly, the same
section of the Standard says, in a subsection titled "Unification",
"The Unicode Standard avoids duplicate encoding of characters by
unifying them within scripts across language" and does not list
exceptions to that rule or limit it to a single script although it
goes on to list "CJK" as an example. Another subsection, "Equivalent
Sequences" indicates "Common precomposed forms ... are included for
compatibility with current standards. For static precomposed forms,
the standard provides a mapping to an equivalent dynamically composed
sequence of characters". The latter appears to be precisely the "all
precomposed characters decompose into the relevant combining
sequences if the relevant base and combining characters exist in the
Standard" that IDNA needs and assumed and, again, there is no mention
of exceptions, language-dependent of otherwise. The summary of
stabiiity policies cited in the Standard [Unicode70-Stability] does
not appear to shed any additional light on these issues.
The Standard now contains a subsection titled "Non-decomposition of
Overlaid Diacritics" [Unicod70-Overlay] that identifies a list of
diacritics that do not normally form characters that have
decompositions. The rule given has its own exceptions and the text
clearly states that there is actually no way to know whether a code
point has a decomposition other than consulting the Unicode Character
Database entry for that code point. The subsequent section notes
that this can be a security problem; while the issues with IDNA go
well beyond what is normally considered security, that comment now
seems clear. While that subsection is helpful in explaining the
problem, especially for European scripts, it does not appear in the
Unicode versions that were current when IDNA2008 was being developed.
3.3.2. Latin Examples and Cases
While this set of problems was discovered because of a code point
added to the Arabic script in precombined form to support a
particular language, there are actually far more examples for, e.g.,
Latin script than there are for Arabic script. Many of them are
associated with the "non-decomposition of combining diacriticals"
issues mentioned above, but the next subsections describe other cases
that are not directly bound to decomposition.
3.3.2.1. The font exclusion and compatability relationships
Unicode contains a large collection of characters that are identified
as "Mathematical Symbols". A large subset of them are basic or
decorated Latin characters, differing from the ordinary ones only by
their usage and, in appearance, by font or type styling (despite the
general principle that font distinctions are not used as the basis
for assigning separate code points. Most of these have canonical
mappings to the base form, which eliminates them from IDNA, but
others do not and, because the same marks that are used as phonetic
diacritical markings in conventional alphabetical use have special
mathematical meanings, applications that permit the use of these
characters have their own issues with normalization and equality.
3.3.2.2. The phonetic notation characters and extensions
Another example involves various Phonetic Alphabet and Extension
characters. many of which, unlike the Mathematical ones, do not have
normalizations that would make them compare equal to the basic
characters with essentially identical representations. This would
not be a problem for IDNA if they were identified with a specialize
script or as symbols rather than letters, but neither is the case:
they are generally identified as lower case Latin Script letters even
when they are visually upper-case, another issue for IDNA.
3.3.2.3. Combineng dots and other shapes combine... unless...
The discussion of "Non-decomposition of Overlaid Diacritics"
[Unicod70-Overlay] indirectly exhibits at least one reason why it has
been difficult to characterize the problem. If one combines that
subsection with others, one gets a set of rules that might be
described as:
1. If the precomposed character and the code points that make up the
combining sequence exist, then canonical composition and
decomposition work as expected, except...
2. If the precomposed character was added to Unicode after the code
points that make up the combining sequence, normalization
stability for the combining sequences requires that NFC applied
to the precomposed character decomposes rather than having the
combining sequence compose to the new character, however...
3. If the combining sequence involves a diacritic or other mark that
actually touches the base character when composed, the
precomposed character does not have a decomposition, unless...
4. The combining diacritic involved is Cedilla (U+0327), Ogonek
(U+0328), or Horn (U+031B), in which case the precomposed
characters that contain them "regularly" (but presumably not
always), and...
5. There are further exceptions for Hamza (which does not overlay
the associated base character in the same way the Latin-derived
combining diacritics and other marks do. Those decisions to
decompose a precomposed character (or not) are based on language
or phonetic considerations, not the combining mechanism or
appearance, or perhaps,...
6. Some characters have compatibility decompositions rather than
canonical ones [Unicod70-CompatDecomp]. Because compatibility
relationships are treated differently by IDNA, PRECIS
[PRECIS-Framework], and, potentially, other protocols involving
identifiers for Internet use, the existence of compatibility
relationship may or may not be helpful. Finally,...
7. There is no reason to believe the above list is complete. In
particular, if whether a precomposed character decomposes or not
is determined by language or phonetic distinctions, one may need
additional rules on a per-script and/or per-character basis.
The above list only covers the cases involving combining sequences.
It does not cover cases such as those in Section 3.3.2.1 and
Section 3.3.2.2 and there may be additional groups of cases not yet
identified.
3.3.2.4. "Legacy" characters and new additions
The development of categories and rules for IDNA recognized that
early version of Unicode might contain some inconsistencies if
evaluated using more contemporary rules about code point assignments
and stability. In particular, there might be some exceptions from
different practices in early version of Unicode or anomalies caused
by copying existing single- or dual-script standards into Unicode as
block rather than individual character additions to the repertoire.
The possibility of such "legacy" exceptions was one reason why the
IDNA category rules include explicit provisions for exception lists
(even though no such code points were identified prior to 2014).
3.3.3. Examples and Cases from Other Scripts
Research into these issues has not yet turned up a comprehensive list
of affected scripts and code points. As discussed elsewhere in this
document, it is clear that Arabic and Latin Scripts are significantly
affected, that some Han and Kangxu radicals and ideographs are
affected, and that other examples do exist -- it is just not known
how many of those examples there are and what patterns, if any,
characterize them.
3.3.4. Scripts with precomposed preferences and ones with combining
preferences
While the authors have been unable to find an explanation for the
differentiation in the Unicode Standard, we have been told that there
are differences among scripts as to whether the action preference is
to add new combining sequences only (and resist adding precomposed
characters) as suggested in Section 3.3.2.3 or to add precomposed
characters, often ones that do not have decompositions. If those
difference in preference do exist, it is probably important to have
them documented so that they can be reflected in IDNA review
procedures and elsewhere. It will also require IETF discussion of
whether combining sequences should be deprecated when the
corresponding precomposed characters are added or to disallow
combining sequences entirely for those scripts (as has been
implicitly suggested for Arabic language use [RFC5564]).
[[CREF1: The above isn't quite right and probably needs additional
discussion and text.]]
3.4. Confusion and the casual user
To the extent to which predictability for relatively casual users is
a desired and important feather of relevant application or
application support protocols, it is probably worth observing that
the complex of rules and cases above is almost certainly too involved
for the typical such user to develop a good intuitive understanding
of how things behave and what relationships exist.
4. Implementation options and issues: Unicode properties, exceptions,
and the nature of stability
4.1. Unicode Stability compared to IETF (and ICANN) Stability
The various stability rules in Unicode [Unicode70-Stability] all
appear to be based on the model that once a value is assigned, it can
never be changed. That is probably appropriate for a character
coding system with multiple uses and applications. It is probably
the only option when normative relationships are expressed in tables
of values rather than by rules. One consequence of such a model is
that it is difficult or impossible to fix mistakes (for some
stability rules, the Unicode Standard does provide for exceptions)
and even harder to make adjustments that would normally be dictated
by evolution.
"No changes" provides a very strong and predictable type of stability
and there are many reasons to take that path. As in some of the
cases that motivated this document, the difficulty is that simply
adding new code points (in Unicode) or features (in a protocol or
application) may be destabilizing. One then has complete stability
for systems that never use or allow the new code points or features,
but rough edges for newer systems that see the discrepancies and
rough edges. IDNA2003 (inadvertently) took that approach by freezing
on Unicode 3.2 -- if no code points added after Unicode 3.2 had ever
been allowed, we would have had complete stability even as Unicode
libraries changed. Unicode has been quite ingenious about working
around those difficulties with such provisions as having code points
for newly-added precomposed characters decompose rather than altering
the normalization for the combining sequences. Other cases, such as
newly-added precomposed characters that do not decompose for, e.g.,
language or phonetic reasons, are more problematic.
The IETF (and ICANN and standards development bodies such as ISO and
ISO/IEC JTC1) have generally adopted a different type of stability
model, one which considers experience in use and the ill effects of
not making changes as well as the disruptive effects of doing so. In
the IETF model, if an earlier decision is causing sufficient harm and
there is consensus in the communities that are most affected that a
change is desirable enough to make transition costs acceptable, then
the change is made.
The difference and its implications are perhaps best illustrated by a
disagreement when IDNA2008 was being approved. IDNA2003 had
effectively prevented some characters, notably (measured by intensity
of the protests) the Sharp S character (U+00DF) from being used in
DNS labels by mapping them to other characters before conversion to
ACE form. It has also prohibited some other code points, notably ZWJ
(U+200D) and ZWNJ (U+200C), by discarding them. In both cases, there
were strong voices from the relevant language communities, supported
by the registry communities, that the characters were important
enough that it was more desirable to undergo the short-term pain of a
transition and some uncertainty than to continue to exclude those
characters and the IDNA2008 rules and repertoire are consistent with
that preference. The Unicode Consortium apparently believed that
stability --elimination of any possibility of label invalidation or
different interpretations of the same string-- was more important
than those writing system requirements and community preferences.
That view was expressed through what was effectively a fork in (or
attempt to nullify) the IETF Standard [UTS46] a result that has
probably been worse for the overall Internet than either of the
possible decision choices.
4.2. New Unicode Properties
One suggestion about the way out of these problems would be to create
one or more new Unicode properties, maintained along with the rest of
Unicode, and then incorporated into new or modified rules or
categories in IDNA. Given the analysis in this document, it appears
that that property (or properties) would need to provide:
1. Identification of combining characters that, when used in
combining sequences, do not produce decomposable characters.
[[CREF2: Wording on the above is not quite right but, for the
present, maybe the intent is clear.]]
2. Identification of precomposed characters that might reasonably be
expected to decompose, but that do not.
3. Identification of character forms that are distinct only because
of language or phonetic distinctions within a script.
4. Identification of scripts for which precomposed forms are
strongly preferred and combining sequences should either be
viewed as temporary mechanisms until precomposed characters are
assigned or banned entirely.
5. Identification of code points that represent symbols for
specific, non-language, purposes even if identified as letters or
numerals by their General Property (see Section 3.3.2.2 and
Section 3.3.2.1).
Some of these properties (or characteristics or values of a single
property) would be suitable for disallowing characters, code points,
or contextual sequences that otherwise might be allowed by IDNA.
Others would be more suitable for making equality comparisons come
out as needed by IDNA, particularly to eliminate distinctions based
on language context.
While it would appear that appropriate rules and categories could be
developed for IDNA (and, presumably, for PRECIS, etc.) if the problem
areas are those identified in this document, it is not yet known
whether the list is complete (and, hence, whether additional
properties or information would be needed.
Even with such properties, IDNA would still almost certainly need
exception lists. In addition, it is likely that stability rules for
those properties would need to reflect IETF norms with arrangements
for bringing the IETF and other communities into the discussion when
tradeoffs are reviewed.
4.3. The need for exception lists
[[CREF3: Note in draft: this section is a partial placeholder and may
need more elaboration.]]
Issues with exception lists and the requirements for them are
discussed in Section 2 above and RFC 5894 [RFC5894].
5. Proposed/ Alternative Changes to RFC 5892 for the issues first
exposed by new code point U+08A1
NOTE IN DRAFT: See the comments in the Introduction, Section 1 and NOTE IN DRAFT: See the comments in the Introduction, Section 1 and
the first paragraph of each Subsection below for the status of the the first paragraph of each Subsection below for the status of the
Subsections that follow. Each one, in combination with the material Subsections that follow. Each one, in combination with the material
in Section 2 above, also provides information about the reasons why in Section 3 above, also provides information about the reasons why
that particular strategy is appropriate. that particular strategy might or might not be appropriate.
3.1. Disallow This New Code Point 5.1. Disallow This New Code Point
This option is almost certainly too Arabic-specific and does not
solve, or even address, the underlying problem. It also does not
inherently generalize to non-decomposing precomposed code points that
might be added in the future (whether to Arabic or other scripts)
even though one could add more code points to Category F in the same
way.
If chosen by the community, this subsection would update the portion If chosen by the community, this subsection would update the portion
of the IDNA2008 specification that identifies rules for what of the IDNA2008 specification that identifies rules for what
characters are permitted [RFC5892] to disallow that code point. characters are permitted [RFC5892] to disallow that code point.
With the publication of this document, Section 2.6 ("Exceptions (F)") With the publication of this document, Section 2.6 ("Exceptions (F)")
of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in
Category F so that the rule itself reads: Category F so that the rule itself reads:
F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660, F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660,
skipping to change at page 10, line 13 skipping to change at page 19, line 28
Section 5.3). However, that category is described as applying only Section 5.3). However, that category is described as applying only
when "property values in versions of Unicode after 5.2 have changed when "property values in versions of Unicode after 5.2 have changed
in such a way that the derived property value would no longer be in such a way that the derived property value would no longer be
PVALID or DISALLOWED". Because U+08A1 is a newly-added code point in PVALID or DISALLOWED". Because U+08A1 is a newly-added code point in
Unicode 7.0.0 and no property values of code points in prior versions Unicode 7.0.0 and no property values of code points in prior versions
have changed, category G does not apply. If that section of RFC 5892 have changed, category G does not apply. If that section of RFC 5892
were to be replaced in the future, perhaps consideration should be were to be replaced in the future, perhaps consideration should be
given to adding Normalization Stability and other issues to that given to adding Normalization Stability and other issues to that
description but, at present, it is not relevant. description but, at present, it is not relevant.
3.2. Disallow the combining sequences for these characters 5.2. Disallow This New Code Point and All Future Precomposed Additions
that do not decompose
At least in principle, the approach suggested above (Section 5.1)
could be expanded to disallow all future allocations of non-
decomposing precomposed characters. This would probably require
either a new Unicode property to identify such characters and/or more
emphasis on the manual, individual code point, checking of the new
Unicode version review proces (i.e,. not just application of the
existing rules and algorithm). It might require either a new rule in
IDNA or a modification to the structure of Category F to make
additions less tedious. It would do nothing for different ways to
form identical characters within the same script that were not
associated with decomposition and so would have to be used in
conjunction with other appropaches. Finally, for scripts (such as
Arabic) where there is a very strong preference to avoid combining
sequences, this approach would exclude exactly the wrong set of
characters.
5.3. Disallow the combining sequences for these characters
As in the approach discussed in Section 5.1, this approach is too
Arabic-specific to address the more general problem. However, it
illustrates a single-script approach and a possible mechanism for
excluding combining sequences whose handling is connected to language
information (information that, as discussed above, is not relevant to
the DNS).
If chosen by the community, this subsection would update the portion If chosen by the community, this subsection would update the portion
of the IDNA2008 specification that identifies contextual rules of the IDNA2008 specification that identifies contextual rules
[RFC5892] to prohibit (combining) Hamza Above (U+0654) in conjunction [RFC5892] to prohibit (combining) Hamza Above (U+0654) in conjunction
with Arabic BEH (U+0628), HAH (U+062D), and REH (U+0631). Note that with Arabic BEH (U+0628), HAH (U+062D), and REH (U+0631). Note that
the choice of this option is consistent with the general preference the choice of this option is consistent with the general preference
for precomposed characters discussed above but would ban some labels for precomposed characters discussed above but would ban some labels
that are valid today and that might, in principle, be in use. that are valid today and that might, in principle, be in use.
The required prohibition could be imposed by creating a new The required prohibition could be imposed by creating a new
contextual rule in RFC 5892 to constrain combining sequences contextual rule in RFC 5892 to constrain combining sequences
containing Hamza Above. containing Hamza Above.
As the Unicode Standard points out at some length [Unicode62-Arabic], As the Unicode Standard points out at some length [Unicode70-Arabic],
Hamza is a problematic abstract character and the "Hamza Above" Hamza is a problematic abstract character and the "Hamza Above"
construction even more so. IDNA has historically associated construction even more so. IDNA has historically associated
characters whose use is reasonable in some contexts but not others characters whose use is reasonable in some contexts but not others
with the special derived property "CONTEXTO" and then specified with the special derived property "CONTEXTO" and then specified
specific, context-dependent, rules about where they may be used. specific, context-dependent, rules about where they may be used.
Because Hamza Above is problematic (and spawns edge cases, as Because Hamza Above is problematic (and spawns edge cases, as
discussed in the Unicode Standard section cited above), it was discussed in the Unicode Standard section cited above), it was
suggested that a contextual rule might be appropriate. There are at suggested that a contextual rule might be appropriate. There are at
least two reasons why a contextual rule would not be suitable for the least two reasons why a contextual rule would not be suitable for the
present situation. present situation.
skipping to change at page 11, line 12 skipping to change at page 21, line 5
characters within that script. Neither of these cases applies to characters within that script. Neither of these cases applies to
the newly-added character even if one could imagine rules for the the newly-added character even if one could imagine rules for the
use of Hamza Above (U+0654) that would reflect the considerations use of Hamza Above (U+0654) that would reflect the considerations
of Chapter 8 of Unicode 6.2. Even had the latter been desired, of Chapter 8 of Unicode 6.2. Even had the latter been desired,
it would be somewhat late now -- Hamza Above has been present as it would be somewhat late now -- Hamza Above has been present as
a combining character (U+0654) in many versions of Unicode. a combining character (U+0654) in many versions of Unicode.
While that section of the Unicode Standard describes the issues, While that section of the Unicode Standard describes the issues,
it does not provide actionable guidance about what to do about it it does not provide actionable guidance about what to do about it
for cases going forward or when visual identity is important. for cases going forward or when visual identity is important.
3.3. Do Nothing Other Than Warn 5.4. Disallow all Combining Characters for Specific Scripts
[[CREF4: This subsevtion needs to be turned into prose, but the
follow bullet points are probably sufficient to identify the
issues.]]
Might work for Arabic and other "precomposed preference" scripts (see
Section 3.3.4; recommended by the Arabic language community for IDNs
[RFC5564]. Hopeless for Latin. Backwards incompatible. No effect
at all on special-use representations of identical characters within
a script (see Section 3.3.2.1 and Section 3.3.2.2).
5.5. Do Nothing Other Than Warn
The recommendation from UTC is to simply warn registries, at all The recommendation from UTC is to simply warn registries, at all
levels of the tree, to be careful with this set of characters, making levels of the tree, to be careful with this set of characters, making
language distinctions within zones. Because the DNS cannot make or language distinctions within zones. Because the DNS cannot make or
enforce language distinctions, this suggestion is problematic but it enforce language distinctions, this suggestion is problematic but it
would avoid having the IETF either invalidating label strings that would avoid having the IETF either invalidating label strings that
are potentially now in use or creating inconsistencies among the are potentially now in use or creating inconsistencies among the
characters that combine with Hamza Above but that also have characters that combine with Hamza Above but that also have
precomposed forms that do not have decompositions. The potential precomposed forms that do not have decompositions. The potential
would still exist for registries to respect the warning and deprecate would still exist for registries to respect the warning and deprecate
such labels if they existed. such labels if they existed.
3.4. Normalization Form IETF (or DNS) 5.6. Normalization Form IETF (NFI))
The most radical possibility would be to decide that none of the The most radical possibility for the comparison issue would be to
Unicode Normalization Forms specified in UAX 15 [UAX15] are adequate decide that none of the Unicode Normalization Forms specified in UAX
for use with the DNS because, contrary to their apparent 15 [UAX15] are adequate for use with the DNS because, contrary to
descriptions, normalization tables are actually determined using their apparent descriptions, normalization tables are actually
language information. However, use of language information is determined using language information. However, use of language
unacceptable for IDNA for reasons described elsewhere in this information is unacceptable for IDNA for reasons described elsewhere
document. The remedy would be to define an IETF-specific (or DNS- in this document. The remedy would be to define an IETF-specific (or
specific) normalization form, building on NFC but adhering strictly DNS-specific) normalization form (sometimes called "NFI" in
to the rule that normalization causes two different forms of the same discussions), building on NFC but adhering strictly to the rule that
character (glyph image) within the same script to be treated as normalization causes two different forms of the same character (glyph
equal. In practice such a form would be implemented for IDNA image) within the same script to be treated as equal. In practice
purposes as an additional rule within RFC 5892 (and its successors) such a form could be implemented for IDNA purposes as an additional
that constituted an exception list for the NFC tables. For this set rule within RFC 5892 (and its successors) that constituted an
of characters, the special IETF normalization form would be exception list for the NFC tables. For this set of characters, the
equivalent to the exclusion discussed in Section 3.2 above. special IETF normalization form would be equivalent to the exclusion
discussed in Section 5.3 above.
4. Editorial clarification to RFC 5892 An Internet-specific normalization form, especially if specified
somewhat separately from the IDNA core, would have a small marginal
advantage over the other strategies in this section (or in
combination with some of them), even though most of the end result
and much of the implementation would be the same in practice. While
the design of IDNA requires that strings be normalized as part of the
process of determining label validity (and hence before either
storage of values in the DNS or name resolution), there is an ongoing
debate about whether normalization should be performed before storing
a string or putting it on the wire or only when the string is
actually compared or otherwise used.
If a normalization procedure with the right properties for the IETF
was defined, that argument could be bypassed and the best decisions
made for different circumstances. The separation would also allow
better comparison of strings that lack language context in
applications environments in which the additional processing and
character classifications of IDNA and/or PRECIS were not applicable.
Having such a normalization procedure defined outside IDNA would also
minimize changes to IDNA itself, which is probably an advantage.
If the new normalizstion form were, in practice, simply an overlay on
NFC with modifications dictated by exception and/or property lists,
keeping its definition separate from IDNA would also avoid
interweaving those exceptions and property lists with the rules and
categories of IDNA itself, avoiding some unnecessary complexity.
6. Editorial clarification to RFC 5892
Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a
clarification to Appendix A and Section A.1 of RFC 5892. This clarification to Appendix A and Section A.1 of RFC 5892. This
section of this document updates the RFC to apply that clarification. section of this document updates the RFC to apply that clarification.
1. In Appendix A, add a new paragraph after the paragraph that 1. In Appendix A, add a new paragraph after the paragraph that
begins "The code point...". The new paragraph should read: begins "The code point...". The new paragraph should read:
"For the rule to be evaluated to True for the label, it MUST be "For the rule to be evaluated to True for the label, it MUST be
evaluated separately for every occurrence of the Code point in evaluated separately for every occurrence of the Code point in
skipping to change at page 12, line 18 skipping to change at page 23, line 5
2. In Appendix A, Section A.1, replace the "Rule Set" by 2. In Appendix A, Section A.1, replace the "Rule Set" by
Rule Set: Rule Set:
False; False;
If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True; If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;
If cp .eq. \u200C And If cp .eq. \u200C And
RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp
(Joining_Type:T)*(Joining_Type:{R,D})) Then True; (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
5. Acknowledgements 7. Acknowledgements
The Unicode 7.0.0 changes were extensively discussed within the IAB's The Unicode 7.0.0 changes were extensively discussed within the IAB's
Internationalization Program. The authors are grateful for the Internationalization Program. The authors are grateful for the
discussions and feedback there, especially from Andrew Sullivan and discussions and feedback there, especially from Andrew Sullivan and
David Thaler. Additional information was requested and received from David Thaler. Additional information was requested and received from
Mark Davis and Ken Whistler and while they probably do not agree with Mark Davis and Ken Whistler and while they probably do not agree with
the necessity of excluding this code point or taking even more the necessity of excluding this code point or taking even more
drastic action as their responsibility is to look at the Unicode drastic action as their responsibility is to look at the Unicode
Consortium requirements for stability, the decision would not have Consortium requirements for stability, the decision would not have
been possible without their input. Thanks to Bill McQuillan and Ted been possible without their input. Thanks to Bill McQuillan and Ted
Hardie for reading versions of the document carefully enough to Hardie for reading versions of the document carefully enough to
identify and report some confusing typographical errors. Several identify and report some confusing typographical errors. Several
experts and reviewers who prefer to remain anonymous also provided experts and reviewers who prefer to remain anonymous also provided
helpful input and comments on preliminary versions of this document. helpful input and comments on preliminary versions of this document.
6. IANA Considerations 8. IANA Considerations
When the IANA registry and tables are updated to reflect Unicode When the IANA registry and tables are updated to reflect Unicode
7.0.0, changes should be made according to the decisions the IETF 7.0.0, changes should be made according to the decisions the IETF
makes about Section 3. makes about Section 5.
7. Security Considerations 9. Security Considerations
[[CREF1: NOTE IN DRAFT: This section is unchanged in version -01 of From at least one point of view, this document is entirely a
this document relative to what appeared in -00. It will need to be discussion of a security issue or set of such issues. While the
rewritten once decisions are made about what path to follow. In "similar-looking characters" issue that has been a concern since the
particular, if "just warn" is chosen, it will need to contain very earliest days of IDNs [HomographAttack] and that has driven assorted
strong warnings.]] "character confusion" projects [ICANN-VIP], if a user types in a
string on one device and can get different results that do not
compare equal when it is typed on a different device (with both
behaving correctly and both keyboards appearing to be the same and
for the same script) then all security mechanism that depend on the
underlying identifiers, including the practical applications of DNS
response integrity checks DNSSEC [RFC4033] and DNS-embedded public
key mechanisms [RFC6698], are at risk if different parties, at least
one of them malicious, obtain some of the identical-appearing and
identically-typed strings.
Mechanisms that depend on trusting registration systems (e.g.,
registries and registrars in the DNS IDN case, see Section 5.5 above)
are likely to be of only limited utility because fully-qualified
domains that may be perfectly reasonable at the first level or two of
the DNS may have differences of this type deep in the tree, into
levels where name management is weak. Similar issues obviously apply
when names are user-selected or unmanaged.
When the issue is not a deliberate attack but simple accidental
confusion among similar strings, most of our strategies depend on the
acceptability of false negatives on matching if there is low risk of
false positives (see, for example, the discussion of false negatives
in identifier comparison in Section 2.1 of RFC 6943 [RFC6943]).
Aspects of that issue appear in, for example, RFC 3986 [RFC3986] and
the PRECIS effort [PRECIS-Framework]. But, because the cases covered
here are connected, not just to what the user sees but to what is
typed and where, there is an increased risk of false positives
(accidental as well as deliberate).
[[CREF5: Note in Draft: The paragraph that follows was written for a
much earlier version of this document. It is obsolete, but is being
retained as a placeholder for future developments.]]
This specification excludes a code point for which the Unicode- This specification excludes a code point for which the Unicode-
specified normalization behavior could result in two ways to form a specified normalization behavior could result in two ways to form a
visually-identical character within the same script not comparing visually-identical character within the same script not comparing
equal. That behavior could create a dream case for someone intending equal. That behavior could create a dream case for someone intending
to confuse the user by use of a domain name that looked identical to to confuse the user by use of a domain name that looked identical to
another one, was entirely in the same script, but was still another one, was entirely in the same script, but was still
considered different (see, for example, the discussion of false considered different.
negatives in identifier comparison in Section 2.1 of RFC 6943
[RFC6943]). This exclusion therefore should improve Internet
security.
8. References Internet Security in areas that involve internationalized identifiers
that might contain the relevant characters is therefore significantly
dependent on some effective resolution for the issues identified in
this document, not just hand waving, devout wishes, or appointment of
study committees about it.
8.1. Normative References 10. References
10.1. Normative References
[PRECIS-Framework]
Saint-Andre, P. and M. Blanchet, "PRECIS Framework:
Preparation, Enforcement, and Comparison of
Internationalized Strings in Application Protocols",
February 2015, <https://datatracker.ietf.org/doc/draft-
ietf-precis-framework/>.
[RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", BCP [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", BCP
137, RFC 5137, February 2008. 137, RFC 5137, February 2008.
[RFC5890] Klensin, J., "Internationalized Domain Names for [RFC5890] Klensin, J., "Internationalized Domain Names for
Applications (IDNA): Definitions and Document Framework", Applications (IDNA): Definitions and Document Framework",
RFC 5890, August 2010. RFC 5890, August 2010.
[RFC5892] Faltstrom, P., "The Unicode Code Points and [RFC5892] Faltstrom, P., "The Unicode Code Points and
Internationalized Domain Names for Applications (IDNA)", Internationalized Domain Names for Applications (IDNA)",
skipping to change at page 14, line 5 skipping to change at page 25, line 35
[UAX15-Exclusion] [UAX15-Exclusion]
"Unicode Standard Annex #15: ob. cit., Section 5", "Unicode Standard Annex #15: ob. cit., Section 5",
<http://www.unicode.org/reports/ <http://www.unicode.org/reports/
tr15/#Primary_Exclusion_List_Table>. tr15/#Primary_Exclusion_List_Table>.
[UAX15-Versioning] [UAX15-Versioning]
"Unicode Standard Annex #15, ob. cit., Section 3", "Unicode Standard Annex #15, ob. cit., Section 3",
<http://www.unicode.org/reports/tr15/#Versioning>. <http://www.unicode.org/reports/tr15/#Versioning>.
[UTS46] Davis, M. and M. Suignard, "Unicode Technical Standard
#46: Unicode IDNA Compatibility Processing", Version
7.0.0, June 2014, <http://unicode.org/reports/tr46/>.
[Unicod70-CompatDecomp]
"The Unicode Standard, Version 7.0.0, ob.cit., Chapter
2.3: Compatibility Characters", Chapter 2, 2014,
<http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>.
Subsection titled "Compatibility Decomposable Characters"
starting on page 26.
[Unicod70-Overlay]
"The Unicode Standard, Version 7.0.0, ob.cit., Chapter
2.2: Unicode Design Principles", Chapter 2, 2014,
<http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>.
Subsection titled "Non-decomposition of Overlaid
Diacritics" starting on page 64.
[Unicode5] [Unicode5]
The Unicode Consortium, "The Unicode Standard, Version The Unicode Consortium, "The Unicode Standard, Version
5.0", ISBN 0-321-48091-0, 2007. 5.0", ISBN 0-321-48091-0, 2007.
Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0. Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0.
This printed reference has now been updated online to This printed reference has now been updated online to
reflect additional code points. For code points, the reflect additional code points. For code points, the
reference at the time RFC 5890-5894 were published is to reference at the time RFC 5890-5894 were published is to
Unicode 5.2. Unicode 5.2.
[Unicode62] [Unicode62]
The Unicode Consortium, "The Unicode Standard, Version The Unicode Consortium, "The Unicode Standard, Version
6.2.0", ISBN 978-1-936213-07-8, 2012, 6.2.0", ISBN 978-1-936213-07-8, 2012,
<http://www.unicode.org/versions/Unicode6.2.0/>. <http://www.unicode.org/versions/Unicode6.2.0/>.
Preferred citation: The Unicode Consortium. The Unicode Preferred citation: The Unicode Consortium. The Unicode
Standard, Version 6.2.0, (Mountain View, CA: The Unicode Standard, Version 6.2.0, (Mountain View, CA: The Unicode
Consortium, 2012. ISBN 978-1-936213-07-8) Consortium, 2012. ISBN 978-1-936213-07-8)
[Unicode62-Arabic]
"The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8",
Chapter 8, 2012,
<http://www.unicode.org/versions/Unicode6.2.0/ch08.pdf>.
Subsection titled "Encoding Principles", paragraph
numbered 4, starting on page 251.
[Unicode62-Hamza]
"The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8",
Chapter 8, 2012,
<http://www.unicode.org/versions/Unicode6.2.0/ch08.pdf>.
Subsection titled "Combining Hamza Above" starting on page
263.
[Unicode7] [Unicode7]
The Unicode Consortium, "The Unicode Standard, Version The Unicode Consortium, "The Unicode Standard, Version
7.0.0", ISBN 978-1-936213-09-2, 2014, 7.0.0", ISBN 978-1-936213-09-2, 2014,
<http://www.unicode.org/versions/Unicode7.0.0/>. <http://www.unicode.org/versions/Unicode7.0.0/>.
Preferred Citation: The Unicode Consortium. The Unicode Preferred Citation: The Unicode Consortium. The Unicode
Standard, Version 7.0.0, (Mountain View, CA: The Unicode Standard, Version 7.0.0, (Mountain View, CA: The Unicode
Consortium, 2014. ISBN 978-1-936213-09-2) Consortium, 2014. ISBN 978-1-936213-09-2)
8.2. Informative References [Unicode70-Arabic]
"The Unicode Standard, Version 7.0.0, ob.cit., Chapter
9.2: Arabic", Chapter 9, 2014,
<http://www.unicode.org/versions/Unicode7.0.0/ch09.pdf>.
Subsection titled "Encoding Principles", paragraph
numbered 4, starting on page 362.
[Unicode70-Design]
"The Unicode Standard, Version 7.0.0, ob.cit., Chapter
2.2: Unicode Design Principles", Chapter 2, 2014,
<http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>.
[Unicode70-Hamza]
"The Unicode Standard, Version 7.0.0, ob.cit., Chapter
9.2: Arabic", Chapter 9, 2014,
<http://www.unicode.org/versions/Unicode7.0.0/ch09.pdf>.
Subsection titled "Combining Hamza Above" starting on page
378.
[Unicode70-Stability]
"The Unicode Standard, Version 7.0.0, ob.cit., Chapter
2.2: Unicode Design Principles", Chapter 2, 2014,
<http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>.
Subsection titled "Stability" starting on page 23 and
containing a link to http://www.unicode.org/policies/
stability_policy.html..
10.2. Informative References
[Dalby] Dalby, A., "Dictionary of Languages: The definitive
reference to more than 400 languages", Columbia Univeristy
Press , 2004.
pages 206-207
[Daniels] Daniels, P. and W. Bright, "The World's Writing Systems",
Oxford University Press , 1986.
[HomographAttack]
Gabrilovich, E. and A. Gontmakher, "The Homograph Attack",
Communications of the ACM 45(2):128, February 2002,
<http://www.cs.technion.ac.il/~gabr/papers/
homograph_full.pdf>.
[ICANN-VIP]
ICANN, "The IDN Variant Issues Project: A Study of Issues
Related to the Management of IDN Variant TLDs (Integrated
Issues Report)", February 2012,
<https://www.icann.org/en/system/files/files/idn-vip-
integrated-issues-final-clean-20feb12-en.pdf>.
[Omniglot-Fula]
Ager, S., "Omniglot: Fula (Fulfulde, Pulaar,
Pular'Fulaare)",
<http://www.omniglot.com/writing/fula.htm>.
Captured 2015-01-07
[RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
"Internationalizing Domain Names in Applications (IDNA)", "Internationalizing Domain Names in Applications (IDNA)",
RFC 3490, March 2003. RFC 3490, March 2003.
[RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
Resource Identifier (URI): Generic Syntax", STD 66, RFC
3986, January 2005.
[RFC4033] Arends, R., Austein, R., Larson, M., Massey, D., and S.
Rose, "DNS Security Introduction and Requirements", RFC
4033, March 2005.
[RFC5564] El-Sherbiny, A., Farah, M., Oueichek, I., and A. Al-Zoman,
"Linguistic Guidelines for the Use of the Arabic Language
in Internet Domains", RFC 5564, February 2010.
[RFC6452] Faltstrom, P. and P. Hoffman, "The Unicode Code Points and [RFC6452] Faltstrom, P. and P. Hoffman, "The Unicode Code Points and
Internationalized Domain Names for Applications (IDNA) - Internationalized Domain Names for Applications (IDNA) -
Unicode 6.0", RFC 6452, November 2011. Unicode 6.0", RFC 6452, November 2011.
[RFC6698] Hoffman, P. and J. Schlyter, "The DNS-Based Authentication
of Named Entities (DANE) Transport Layer Security (TLS)
Protocol: TLSA", RFC 6698, August 2012.
[Unicode32] [Unicode32]
The Unicode Consortium, "The Unicode Standard, Version The Unicode Consortium, "The Unicode Standard, Version
3.2.0", . 3.2.0", .
The Unicode Standard, Version 3.2.0 is defined by The The Unicode Standard, Version 3.2.0 is defined by The
Unicode Standard, Version 3.0 (Reading, MA, Addison- Unicode Standard, Version 3.0 (Reading, MA, Addison-
Wesley, 2000. ISBN 0-201-61633-5), as amended by the Wesley, 2000. ISBN 0-201-61633-5), as amended by the
Unicode Standard Annex #27: Unicode 3.1 Unicode Standard Annex #27: Unicode 3.1
(http://www.unicode.org/reports/tr27/) and by the Unicode (http://www.unicode.org/reports/tr27/) and by the Unicode
Standard Annex #28: Unicode 3.2 Standard Annex #28: Unicode 3.2
skipping to change at page 15, line 48 skipping to change at page 29, line 10
A.2. Changes from version -01 to -02 A.2. Changes from version -01 to -02
Corrected a typographical error in which Hamza Above was incorrectly Corrected a typographical error in which Hamza Above was incorrectly
listed with the wrong code point. listed with the wrong code point.
A.3. Changes from version -02 to -03 A.3. Changes from version -02 to -03
Corrected a typographical error in the Abstract in which RFC 5892 was Corrected a typographical error in the Abstract in which RFC 5892 was
incorrectly shown as 5982. incorrectly shown as 5982.
A.4. Changes from version -03 to -04
o Explicitly identified the applicability of U+08A1 with Fula and
added references that discuss that language and how it is written.
o Updated several Unicode 6.2 references to point to Unicode 7.0
since the latter is now available in stable form (it was done when
work on this I-D started).
o Extensively revised to discuss the non-Arabic cases, non-
decomposing diacritics, other types of characters that don't
compare equal after normalization, and more general problem and
approaches.
Authors' Addresses Authors' Addresses
John C Klensin John C Klensin
1770 Massachusetts Ave, Ste 322 1770 Massachusetts Ave, Ste 322
Cambridge, MA 02140 Cambridge, MA 02140
USA USA
Phone: +1 617 245 1457 Phone: +1 617 245 1457
Email: john-ietf@jck.com Email: john-ietf@jck.com
Patrik Faltstrom Patrik Faltstrom
Netnod Netnod
 End of changes. 55 change blocks. 
183 lines changed or deleted 827 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/