| < draft-klensin-idna-5892upd-unicode70-03.txt | draft-klensin-idna-5892upd-unicode70-04.txt > | |||
|---|---|---|---|---|
| Network Working Group J. Klensin | Network Working Group J. Klensin | |||
| Internet-Draft | Internet-Draft | |||
| Updates: 5892, 5894 (if approved) P. Faltstrom | Updates: 5892, 5894 (if approved) P. Faltstrom | |||
| Intended status: Standards Track Netnod | Intended status: Standards Track Netnod | |||
| Expires: July 10, 2015 January 6, 2015 | Expires: September 11, 2015 March 10, 2015 | |||
| IDNA Update for Unicode 7.0.0 | IDNA Update for Unicode 7.0.0 | |||
| draft-klensin-idna-5892upd-unicode70-03.txt | draft-klensin-idna-5892upd-unicode70-04.txt | |||
| Abstract | Abstract | |||
| The current version of the IDNA specifications anticipated that each | The current version of the IDNA specifications anticipated that each | |||
| new version of Unicode would be reviewed to verify that no changes | new version of Unicode would be reviewed to verify that no changes | |||
| had been introduced that required adjustments to the set of rules | had been introduced that required adjustments to the set of rules | |||
| and, in particular, whether new exceptions or backward compatibility | and, in particular, whether new exceptions or backward compatibility | |||
| adjustments were needed. That review was conducted for Unicode 7.0.0 | adjustments were needed. The review for Unicode 7.0.0 first | |||
| and identified a potentially problematic new code point. This | identified a potentially problematic new code point and then a much | |||
| specification discusses that code point and associated issues and | more general and difficult issue with Unicode normalization. This | |||
| updates RFC 5892 accordingly. It also applies an editorial | specification discusses those issues and proposes updates to IDNA | |||
| clarification that was the subject of an earlier erratum. In | and, potentially, the way the IETF handles comparison of identifiers | |||
| addition, the discussion of the specific issue updates RFC 5894. | more generally, especially when there is no associated language or | |||
| language identification. It also applies an editorial clarification | ||||
| to RFC 5892 that was the subject of an earlier erratum and updates | ||||
| RFC 5894 to point to the issues involved. | ||||
| Status of This Memo | Status of This Memo | |||
| This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
| provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on July 10, 2015. | This Internet-Draft will expire on September 11, 2015. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2015 IETF Trust and the persons identified as the | Copyright (c) 2015 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
| to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
| include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
| the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
| described in the Simplified BSD License. | described in the Simplified BSD License. | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 2. Problem Description . . . . . . . . . . . . . . . . . . . . . 5 | 2. Document Aspirations . . . . . . . . . . . . . . . . . . . . 6 | |||
| 2.1. IDNA assumptions about Unicode normalization . . . . . . 5 | 3. Problem Description . . . . . . . . . . . . . . . . . . . . . 7 | |||
| 2.2. New code point U+08A1, decomposition, and language | 3.1. IDNA assumptions about Unicode normalization . . . . . . 7 | |||
| dependency . . . . . . . . . . . . . . . . . . . . . . . 6 | 3.2. The discovery and the Arabic script cases . . . . . . . . 9 | |||
| 2.3. Other examples of the same behavior . . . . . . . . . . . 7 | 3.2.1. New code point U+08A1, decomposition, and language | |||
| 2.4. Hamza and Combining Sequences . . . . . . . . . . . . . . 8 | dependency . . . . . . . . . . . . . . . . . . . . . 9 | |||
| 3. Proposed/ Alternative Changes to RFC 5892 for new character | 3.2.2. Other examples of the same behavior within the Arabic | |||
| U+08A1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 | Script . . . . . . . . . . . . . . . . . . . . . . . 10 | |||
| 3.1. Disallow This New Code Point . . . . . . . . . . . . . . 9 | 3.2.3. Hamza and Combining Sequences . . . . . . . . . . . . 10 | |||
| 3.2. Disallow the combining sequences for these characters . . 10 | 3.3. Precomposed characters without decompositions more | |||
| 3.3. Do Nothing Other Than Warn . . . . . . . . . . . . . . . 11 | generally . . . . . . . . . . . . . . . . . . . . . . . . 11 | |||
| 3.4. Normalization Form IETF (or DNS) . . . . . . . . . . . . 11 | 3.3.1. Description of the general problem . . . . . . . . . 11 | |||
| 4. Editorial clarification to RFC 5892 . . . . . . . . . . . . . 11 | 3.3.2. Latin Examples and Cases . . . . . . . . . . . . . . 12 | |||
| 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 | 3.3.3. Examples and Cases from Other Scripts . . . . . . . . 14 | |||
| 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 | 3.3.4. Scripts with precomposed preferences and ones with | |||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 | combining preferences . . . . . . . . . . . . . . . . 15 | |||
| 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 | 3.4. Confusion and the casual user . . . . . . . . . . . . . . 15 | |||
| 8.1. Normative References . . . . . . . . . . . . . . . . . . 13 | 4. Implementation options and issues: Unicode properties, | |||
| 8.2. Informative References . . . . . . . . . . . . . . . . . 15 | exceptions, and the nature of stability . . . . . . . . . . . 15 | |||
| Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 15 | 4.1. Unicode Stability compared to IETF (and ICANN) Stability 15 | |||
| A.1. Changes from version -00 to -01 . . . . . . . . . . . . . 15 | 4.2. New Unicode Properties . . . . . . . . . . . . . . . . . 17 | |||
| A.2. Changes from version -01 to -02 . . . . . . . . . . . . . 15 | 4.3. The need for exception lists . . . . . . . . . . . . . . 18 | |||
| A.3. Changes from version -02 to -03 . . . . . . . . . . . . . 15 | 5. Proposed/ Alternative Changes to RFC 5892 for the issues | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 | first exposed by new code point U+08A1 . . . . . . . . . . . 18 | |||
| 5.1. Disallow This New Code Point . . . . . . . . . . . . . . 18 | ||||
| 5.2. Disallow This New Code Point and All Future Precomposed | ||||
| Additions that do not decompose . . . . . . . . . . . . . 19 | ||||
| 5.3. Disallow the combining sequences for these characters . . 19 | ||||
| 5.4. Disallow all Combining Characters for Specific Scripts . 21 | ||||
| 5.5. Do Nothing Other Than Warn . . . . . . . . . . . . . . . 21 | ||||
| 5.6. Normalization Form IETF (NFI)) . . . . . . . . . . . . . 21 | ||||
| 6. Editorial clarification to RFC 5892 . . . . . . . . . . . . . 22 | ||||
| 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 23 | ||||
| 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 23 | ||||
| 9. Security Considerations . . . . . . . . . . . . . . . . . . . 23 | ||||
| 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 24 | ||||
| 10.1. Normative References . . . . . . . . . . . . . . . . . . 24 | ||||
| 10.2. Informative References . . . . . . . . . . . . . . . . . 27 | ||||
| Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 28 | ||||
| A.1. Changes from version -00 to -01 . . . . . . . . . . . . . 28 | ||||
| A.2. Changes from version -01 to -02 . . . . . . . . . . . . . 28 | ||||
| A.3. Changes from version -02 to -03 . . . . . . . . . . . . . 29 | ||||
| A.4. Changes from version -03 to -04 . . . . . . . . . . . . . 29 | ||||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 29 | ||||
| 1. Introduction | 1. Introduction | |||
| Note in/about -04 Draft: This version of the document contains a | ||||
| very large amount of new material as compared to the -03 version. | ||||
| The new material reflects an evolution of community understanding | ||||
| in the last two months from an assumption that the problem | ||||
| involved only a few code points and one combining character in a | ||||
| single script (Hamza Above and Arabic) to an understanding that it | ||||
| is quite pervasive and may represent fundamental misunderstandings | ||||
| or omissions from IDNA2008 (and, by extension, the basics of | ||||
| PRECIS [PRECIS-Framework]) that must be corrected if those | ||||
| protocols are going to be used in a way that supports Internet | ||||
| internationalized identifiers predictability (as seen by the end | ||||
| user) and security. | ||||
| This version is still necessarily incomplete: not only is our | ||||
| understanding probably still not comprehensive, but there are a | ||||
| number of placeholders for text and references. Nonetheless, the | ||||
| document in its current form should be useful as both the | ||||
| beginning of a comprehensive overview is the issues and a source | ||||
| of references to other relevant materials. | ||||
| This draft could almost certainly be organized better to improve | ||||
| its readability: specific suggestion would be welcome. | ||||
| The current version of the IDNA specifications, known as "IDNA2008" | The current version of the IDNA specifications, known as "IDNA2008" | |||
| [RFC5890], anticipated that each new version of Unicode would be | [RFC5890], anticipated that each new version of Unicode would be | |||
| reviewed to verify that no changes had been introduced that required | reviewed to verify that no changes had been introduced that required | |||
| adjustments to IDNA's rules and, in particular, whether new | adjustments to IDNA's rules and, in particular, whether new | |||
| exceptions or backward compatibility adjustments were needed. When | exceptions or backward compatibility adjustments were needed. When | |||
| that review was carefully conducted for Unicode 7.0.0 [Unicode7], | that review was carefully conducted for Unicode 7.0.0 [Unicode7], | |||
| comparing it to prior versions including the text in Unicode 6.2 | comparing it to prior versions including the text in Unicode 6.2 | |||
| [Unicode62], it identified a problematic new code point (U+08A1, | [Unicode62], it identified a problematic new code point (U+08A1, | |||
| ARABIC LETTER BEH WITH HAMZA ABOVE). The specific problem is | ARABIC LETTER BEH WITH HAMZA ABOVE). The code point was added for | |||
| discussed in detail in Section 2. The behavior of that code point, | use with the Fula (also known as Fulfulde, Pulaar, amd Pular'Fulaare) | |||
| while non-optimal for IDNA, follows that of a few code points that | language, a language that, apparently, is most often written in Latin | |||
| predate Unicode 7.x and even the IDNA 2008 specifications and Unicode | characters today [Omniglot-Fula] [Dalby] [Daniels]. | |||
| 6.0. Those existing code points make the question of what, if | ||||
| anything, to do about this new one exceedingly problematic because | The specific problem is discussed in detail in Section 3. In very | |||
| broad terms, IDNA (and other IETF work) assume that, if one can | ||||
| represent "the same character" either as a combining sequence or as a | ||||
| single code point, strings that are identical except for those | ||||
| alternate forms will compare equal after normalization. Part of the | ||||
| difficulty that has characterized this discussion is that "the same" | ||||
| differs depending on the criteria that are chosen. | ||||
| The behavior of the newly-added code point, while non-optimal for | ||||
| IDNA, follows that of a few code points that predate Unicode 7.x and | ||||
| even the IDNA 2008 specifications and Unicode 6.0. Those existing | ||||
| code points, which may not be easy to accurately characterize as a | ||||
| group, make the question of what, if anything, to do about this new | ||||
| exceedingly problematic one and, perhaps separately, what to do about | ||||
| existing sets of code points with the same behavior, because | ||||
| different reasonable criteria yield different decisions, | different reasonable criteria yield different decisions, | |||
| specifically: | specifically: | |||
| o To disallow it as an IDNA exception case creates inconsistencies | o To disallow it (and future, but not existing characters with | |||
| with how those earlier code points were handled. | similar characteristics) as an IDNA exception case creates | |||
| inconsistencies with how those earlier code points were handled. | ||||
| o To disallow it and the similar code points as well would | o To disallow it and the similar code points as well would | |||
| necessitate invalidating some potential labels that would have | necessitate invalidating some potential labels that would have | |||
| been valid under IDNA2008 until this time. However, there is | been valid under IDNA2008 until this time. Depending on how the | |||
| reason to believe that no such labels exist. | collection of similar code points is characterized, a few of them | |||
| are almost certainly used in reasonable labels. | ||||
| o To permit the new code point to be treated as PVALID creates a | o To permit the new code point to be treated as PVALID creates a | |||
| situation in which it is possible, within the same script, to | situation in which it is possible, within the same script, to | |||
| compose the same character symbol (glyph) in two different ways | compose the same character symbol (glyph) in two different ways | |||
| that do not compare equal even after normalization. That | that do not compare equal even after normalization. That | |||
| condition would then apply to it and the earlier code points with | condition would then apply to it and the earlier code points with | |||
| the same behavior. That situation contradicts a fundamental | the same behavior. That situation contradicts a fundamental | |||
| assumption of IDNA that is discussed in more detail below. | assumption of IDNA that is discussed in more detail below. | |||
| NOTE IN DRAFT: | NOTE IN DRAFT: | |||
| This working draft discusses four alternatives, including, for | This working draft discusses six alternatives, including an idea | |||
| illustration, a radical idea that seems too drastic to be | (an IETF-specific normalization form) that seemed too drastic to | |||
| considered now although it would have been appropriate to discuss | be considered a few months ago. However, it not only would have | |||
| when the IDNA2008 specifications were being developed. The | been appropriate to discuss when the IDNA2008 specifications were | |||
| authors suggest that the community discuss the relevant tradeoffs | being developed but is appearing more attractive now. The authors | |||
| and make a decision and that the document then be revised to | suggest that the community discuss the relevant tradeoffs and make | |||
| reflect that decision, with the other alternatives discussed as | a decision and that the document then be revised to reflect that | |||
| options not chosen. Because there is no ideal choice, the | decision, with the other alternatives discussed as options not | |||
| discussion of the issues in Section 2, is probably as or more | chosen. Because there is no ideal choice, the discussion of the | |||
| important than the particular choice of how to handle this code | issues in Section 3, is probably as or more important than the | |||
| point. In addition to providing information for this document, | particular choice of how to handle this code point. In addition | |||
| that section should be considered as an updating addendum to RFC | to providing information for this document, that section should be | |||
| 5894 [RFC5894] and should be incorporated into any future revision | considered as an updating addendum to RFC 5894 [RFC5894] and | |||
| of that document. | should be incorporated into any future revision of that document. | |||
| As the result of this version of the document containing several | As the result of this version of the document containing several | |||
| alternate proposals, some of the text is also a little bit | alternate proposals, some of the text is also a little bit | |||
| redundant. That will be corrected in future versions. | redundant. That will be corrected in future versions. | |||
| As anticipated when IDNA2008, and RFC 5892 in particular, were | As anticipated when IDNA2008, and RFC 5892 in particular, were | |||
| written, exceptions and explicit updates are likely to be needed only | written, exceptions and explicit updates are likely to be needed only | |||
| if there is disagreement between the Unicode Consortium's view about | if there is disagreement between the Unicode Consortium's view about | |||
| what is best for the Standard and the IETF's view of what is best for | what is best for the Standard and the IETF's view of what is best for | |||
| IDNs, the DNS, and IDNA. It was hoped that a situation would never | IDNs, the DNS, and IDNA. It was hoped that a situation would never | |||
| skipping to change at page 4, line 21 ¶ | skipping to change at page 5, line 35 ¶ | |||
| Technical Committee members suggested be disallowed to avoid a change | Technical Committee members suggested be disallowed to avoid a change | |||
| in derived tables [RFC6452]. This document describes a case where | in derived tables [RFC6452]. This document describes a case where | |||
| the IETF should disallow a character or characters that the various | the IETF should disallow a character or characters that the various | |||
| properties would otherwise treat as PVALID. | properties would otherwise treat as PVALID. | |||
| This document provides the "flagging for the IESG" specified by | This document provides the "flagging for the IESG" specified by | |||
| Section 5.1 of RFC 5892. As specified there, the change itself | Section 5.1 of RFC 5892. As specified there, the change itself | |||
| requires IETF review because it alters the rules of Section 2 of that | requires IETF review because it alters the rules of Section 2 of that | |||
| document. | document. | |||
| Readers of this document are expected to be familiar with Unicode | ||||
| terminology [Unicode62] and the IETF conventions for representing | ||||
| Unicode code points [RFC5137]. | ||||
| As a convenience to readers of RFC 5892 and to reduce the risks of | ||||
| confusion, this document also formally applies the content of an | ||||
| erratum to the text of the RFC (see Section 4) and so brings that RFC | ||||
| up to date with all agreed changes. | ||||
| [[RFC Editor: please remove the following comment and note if they | [[RFC Editor: please remove the following comment and note if they | |||
| get to you.]] | get to you.]] | |||
| [[IESG: It might not be a bad idea to incorporate some version of | [[IESG: It might not be a bad idea to incorporate some version of | |||
| the following into the Last Call announcement.]] | the following into the Last Call announcement.]] | |||
| NOTE IN DRAFT to IETF Reviewers: The issues in this document, and | NOTE IN DRAFT to IETF Reviewers: The issues in this document, and | |||
| particularly the choices among options for either adding exception | particularly the choices among options for either adding exception | |||
| cases to RFC 5892 or ignoring the issue, warning people, and | cases to RFC 5892 or ignoring the issue, warning people, and | |||
| hoping the results do not include serious problems, are fairly | hoping the results do not include serious problems, are fairly | |||
| esoteric. Understanding them requires that one have at least some | esoteric. Understanding them requires that one have at least some | |||
| understanding of how the Arabic Script works and the reasons the | understanding of how the Arabic Script (and perhaps other scripts | |||
| Unicode Standard gives various Arabic Script characters a fairly | in which precomposed characters are preferred over combining | |||
| extended discussion [Unicode62-Arabic]. It also requires | sequences as a Unicode design and extension principle) works and | |||
| understanding of a number of Unicode principles, including the | the reasons the Unicode Standard gives various Arabic Script | |||
| Normalization Stability rules [UAX15-Versioning] as applied to new | characters a fairly extended discussion [Unicode70-Arabic]. It | |||
| precomposed characters and guidelines for adding new characters. | also requires understanding of a number of Unicode principles, | |||
| There is considerable discussion of the issues in Section 2 and | including the Normalization Stability rules [UAX15-Versioning] as | |||
| references are provided for those who want to pursue them, but | applied to new precomposed characters and guidelines for adding | |||
| potential reviewers should assume that the background needed to | new characters. There is considerable discussion of the issues in | |||
| understand the reasons for this change is no less deep in the | Section 3 and references are provided for those who want to pursue | |||
| subject matter than would be expected of someone reviewing a | them, but potential reviewers should assume that the background | |||
| proposed change in, e.g., the fundamentals of BGP, TCP congestion | needed to understand the reasons for this change is no less deep | |||
| control, or some cryptographic algorithm. Put more bluntly, one's | in the subject matter than would be expected of someone reviewing | |||
| ability to read or speak languages other than English, or even one | a proposed change in, e.g., the fundamentals of BGP, TCP | |||
| or more languages that use the Arabic script, does not make one an | congestion control, or some cryptographic algorithm. Put more | |||
| expert in these matters. | bluntly, one's ability to read or speak languages other than | |||
| English, or even one or more languages that use the Arabic script | ||||
| or other scripts similarly affected, does not make one an expert | ||||
| in these matters. | ||||
| 2. Problem Description | This document assumes that the reader is reasonably familiar with the | |||
| terminology of IDNA [RFC5890] and Unicode [Unicode7] and with the | ||||
| IETF conventions for representing Unicode code points [RFC5137]. | ||||
| Some terms used here may not be used in the same way in those two | ||||
| sets of documents. From one point of view, those differences may | ||||
| have been the results of, or led to, misunderstandings that may, in | ||||
| turn, be part of the root cause of the problems explored in this | ||||
| document. In particular, this document uses the term "precomposed | ||||
| character" to describe characters that could reasonably be composed | ||||
| by a combining sequence using code points in the same but for which a | ||||
| single code point that does not require combining sequences is | ||||
| available. That definition is strictly about mechanical composition | ||||
| and does not involve any considerations about how the character is | ||||
| used. It is closely related to this document's definition of | ||||
| "identical". When a precomposed character exists and either applying | ||||
| NFC to the combining sequence does not yield that character or | ||||
| applying NFD to that character's code point does not yield the | ||||
| combining sequence, it is referred to in this document as "non- | ||||
| decomposable" | ||||
| 2.1. IDNA assumptions about Unicode normalization | 2. Document Aspirations | |||
| This document, in its present form, is not a proposal for a solution. | ||||
| Instead, it is intended to be (or evolve into) a comprehensive | ||||
| description of the issues and problems and to outline some possible | ||||
| approaches to a solution. A perfect solution -- one that would | ||||
| resolve all of the issues identified in this document, would involve | ||||
| a relatively small set of relatively simple rules and hence would be | ||||
| comprehensible and predictable for and by non-expert end users, would | ||||
| not require code point by code point or even block by block exception | ||||
| lists, and would not leave uses of any script or language feeling | ||||
| that their particular writing system have been treated less fairly | ||||
| than others. | ||||
| Part of the reality we need to accept is that IDNA, in its present | ||||
| form, represents compromises that does not completely satisfy those | ||||
| criteria and whatever is done about these issues will probably make | ||||
| it (or the job of administering zones containing IDNs) more complex. | ||||
| Similarly, as the Unicode Standard suggests when it identifies ten | ||||
| Design Principles and the text then says "Not all of these principles | ||||
| can be satisfied simultaneously..." [Unicode70-Design], while there | ||||
| are guidelines and principles, a certain amount of subjective | ||||
| judgment is involved in making determinations about normalization, | ||||
| decomposition, and some property values. For Unicode itself, those | ||||
| issues are resolved by multiple statements (at least one cited below) | ||||
| that one needs to rely on per-code point information in the Unicode | ||||
| Character Database rather than on rules or principles. The design of | ||||
| IDNA and the effort to keep it largely independent of Unicode | ||||
| versions requires rules, categories, and principles that can be | ||||
| relied upon and applied algorithmically. There is obviously some | ||||
| tension between the two approaches. | ||||
| 3. Problem Description | ||||
| 3.1. IDNA assumptions about Unicode normalization | ||||
| IDNA makes several assumptions about Unicode, Unicode "characters", | IDNA makes several assumptions about Unicode, Unicode "characters", | |||
| and the effects of normalization. Those assumptions were based on | and the effects of normalization. Those assumptions were based on | |||
| careful reading of the Unicode Standard at the time [Unicode5], | careful reading of the Unicode Standard at the time [Unicode5], | |||
| guided by advice and commitments by members of the Unicode Technical | guided by advice and commitments by members of the Unicode Technical | |||
| Committee. Those assumptions, and the associated requirements, are | Committee. Those assumptions, and the associated requirements, are | |||
| necessitated by three properties of DNS labels that do not apply to | necessitated by three properties of DNS labels that typically do not | |||
| blocks of running text: | apply to blocks of running text: | |||
| 1. There is no language context for a label. While particular DNS | 1. There is no language context for a label. While particular DNS | |||
| zones may impose restrictions, including language or script | zones may impose restrictions, including language or script | |||
| restrictions, on what labels can be registered, neither the DNS | restrictions, on what labels can be registered, neither the DNS | |||
| nor IDNA impose either type of restriction or give the user of a | nor IDNA impose either type of restriction or give the user of a | |||
| label any indication about the registration or other restrictions | label any indication about the registration or other restrictions | |||
| that may have been imposed. | that may have been imposed. | |||
| 2. Labels are often mnemonics rather than words in any language. | 2. Labels are often mnemonics rather than words in any language. | |||
| They may be abbreviations or acronyms or contain embedded digits | They may be abbreviations or acronyms or contain embedded digits | |||
| skipping to change at page 6, line 12 ¶ | skipping to change at page 8, line 24 ¶ | |||
| are identical in appearance (e.g., basic Latin "a" (U+0061) and the | are identical in appearance (e.g., basic Latin "a" (U+0061) and the | |||
| identical-appearing Cyrillic character (U+0430), the most important | identical-appearing Cyrillic character (U+0430), the most important | |||
| test is that, if two glyphs are the same within a given script, they | test is that, if two glyphs are the same within a given script, they | |||
| must represent the same character no matter how they are formed. | must represent the same character no matter how they are formed. | |||
| Unicode normalization, as explained in [UAX15], is expected to | Unicode normalization, as explained in [UAX15], is expected to | |||
| resolve those "same script, same glyph, different formation methods" | resolve those "same script, same glyph, different formation methods" | |||
| issues. Within the Latin script, the code point sequence for lower | issues. Within the Latin script, the code point sequence for lower | |||
| case "o" (U+006F) and combining diaeresis (U+0308) will, when | case "o" (U+006F) and combining diaeresis (U+0308) will, when | |||
| normalized using the "NFC" method required by IDNA, produce the | normalized using the "NFC" method required by IDNA, produce the | |||
| precombined small letter o with diaeresis (U+00F6) and hence the two | precomposed small letter o with diaeresis (U+00F6) and hence the two | |||
| ways of forming the character will compare equal (and the combining | ways of forming the character will compare equal (and the combining | |||
| sequence is effectively prohibited from U-labels). | sequence is effectively prohibited from U-labels). | |||
| NFC was preferred over other normalization methods for IDNA because | NFC was preferred over other normalization methods for IDNA because | |||
| it is more compact, more likely to be produced on keyboards on which | it is more compact, more likely to be produced on keyboards on which | |||
| the relevant characters actually appeared, and because it does not | the relevant characters actually appeared, and because it does not | |||
| lose substantive information (e.g., some types of compatibility | lose substantive information (e.g., some types of compatibility | |||
| equivalence involves judgment calls as to whether two characters are | equivalence involves judgment calls as to whether two characters are | |||
| actually the same -- they may be "the same" in some contexts but not | actually the same -- they may be "the same" in some contexts but not | |||
| others -- while canonical equivalence is about different ways to | others -- while canonical equivalence is about different ways to | |||
| produce the glyph for the same abstract character). | produce the glyph for the same abstract character). | |||
| IDNA also assumed that the extensive Unicode stability rules would be | IDNA also assumed that the extensive Unicode stability rules would be | |||
| applied and work as specified when new code points were added. Those | applied and work as specified when new code points were added. Those | |||
| rules, as described in The Unicode Standard and the normative annexes | rules, as described in The Unicode Standard and the normative annexes | |||
| identified below, provide that: | identified below, provide that: | |||
| 1. New code points representing precombined characters that can be | 1. New code points representing precomposed characters that can be | |||
| formed from combining sequences will not be added to Unicode | formed from combining sequences will not be added to Unicode | |||
| unless neither the relevant base character nor required combining | unless neither the relevant base character nor required combining | |||
| character are part of the Standard within the relevant script | character(s) are part of the Standard within the relevant script | |||
| [UAX15-Versioning]. | [UAX15-Versioning]. | |||
| 2. If circumstances require that principle be violated, | 2. If circumstances require that principle be violated, | |||
| normalization stability requires that the newly-added character | normalization stability requires that the newly-added character | |||
| decompose (even under NFC) to the previously-available combining | decompose (even under NFC) to the previously-available combining | |||
| sequence [UAX15-Exclusion]. | sequence [UAX15-Exclusion]. | |||
| There is no explicit provision in the Standard's discussion of | At least at the time IDNA2008 was being developed, there was no | |||
| conditions for adding new code points, nor of normalization | explicit provision in the Standard's discussion of conditions for | |||
| stability, for an exception based on different languages using the | adding new code points, nor of normalization stability, for an | |||
| same script. | exception based on different languages using the same script or | |||
| ambiguities about the shape or positioning of combining characters. | ||||
| 2.2. New code point U+08A1, decomposition, and language dependency | 3.2. The discovery and the Arabic script cases | |||
| While the set of problems with normalization discussed above were | ||||
| discovered with a newly-added code point for the Arabic Script and | ||||
| some characteristics of Unicode handling of that script seem to make | ||||
| the problem more complex going forward, these are not issues specific | ||||
| to Arabic. This section describes the Arabic-specific problems; | ||||
| subsequent ones (starting with Section 3.3) discuss the problem more | ||||
| generally and include illustrations from other scripts. | ||||
| 3.2.1. New code point U+08A1, decomposition, and language dependency | ||||
| Unicode 7.0.0 introduces the new code point U+08A1, ARABIC LETTER BEH | Unicode 7.0.0 introduces the new code point U+08A1, ARABIC LETTER BEH | |||
| WITH HAMZA ABOVE. As can be deduced from the name, it is visually | WITH HAMZA ABOVE. As can be deduced from the name, it is visually | |||
| identical to the glyph that can be formed from a combining sequence | identical to the glyph that can be formed from a combining sequence | |||
| consisting of the code point for ARABIC LETTER BEH (U+0628) and the | consisting of the code point for ARABIC LETTER BEH (U+0628) and the | |||
| code point for Combining Hamza Above (U+0654). The two rules | code point for Combining Hamza Above (U+0654). The two rules | |||
| summarized above suggest that either the new code point should not be | summarized above (see the last part of Section 3.1) suggest that | |||
| allocated at all or that it should have a decomposition to | either the new code point should not be allocated at all or that it | |||
| \u'0628'\u'0654'. | should have a decomposition to \u'0628'\u'0654'. | |||
| Had the issues outlined in this document been better understood at | Had the issues outlined in this document been better understood at | |||
| the time, it probably would have been wise for RFC 5892 to disallow | the time, it probably would have been wise for RFC 5892 to disallow | |||
| either the precomposed character or the combining sequence of each | either the precomposed character or the combining sequence of each | |||
| pair in those cases in which Unicode normalization rules do not cause | pair in those cases in which Unicode normalization rules do not cause | |||
| the right thing to happen, i.e., the combining sequence and | the right thing to happen, i.e., the combining sequence and | |||
| precomposed character to be treated as equivalent. Failure to do so | precomposed character to be treated as equivalent. Failure to do so | |||
| at the time places an extra burden on registries to be sure that | at the time places an extra burden on registries to be sure that | |||
| conflicts (and the potential for confusion and attacks) do not exist. | conflicts (and the potential for confusion and attacks) do not exist. | |||
| Oddly, had the exclusion been made part of the specification at that | Oddly, had the exclusion been made part of the specification at that | |||
| time, the preference for precombined forms noted above would probably | time, the preference for precomposed forms noted above would probably | |||
| have dictated excluding the combining sequence, something not | have dictated excluding the combining sequence, something not | |||
| otherwise done in IDNA2008 because the NFC requirement serves the | otherwise done in IDNA2008 because the NFC requirement serves the | |||
| same purpose. Today, the only thing that can be excluded without the | same purpose. Today, the only thing that can be excluded without the | |||
| potential disruption of disallowing a previously-PVALID combining | potential disruption of disallowing a previously-PVALID combining | |||
| sequence is the to exclude the newly-added code point so whatever is | sequence is the to exclude the newly-added code point so whatever is | |||
| done, or might have been contemplated with hindsight, will be | done, or might have been contemplated with hindsight, will be | |||
| somewhat inconsistent. | somewhat inconsistent. | |||
| 2.3. Other examples of the same behavior | 3.2.2. Other examples of the same behavior within the Arabic Script | |||
| One of the things that complicates the issue with the new U+08A1 code | One of the things that complicates the issue with the new U+08A1 code | |||
| point is that there are several other Arabic-script code points that | point is that there are several other Arabic-script code points that | |||
| behave in the same way for similar language-specific reasons. | behave in the same way for similar language-specific reasons. | |||
| In particular, at least three other grapheme clusters that have been | In particular, at least three other grapheme clusters that have been | |||
| present for many version of Unicode can be seen as involving issues | present for many version of Unicode can be seen as involving issues | |||
| similar to those for the newly-added ARABIC LETTER BEH WITH HAMZA | similar to those for the newly-added ARABIC LETTER BEH WITH HAMZA | |||
| ABOVE. ARABIC LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER | ABOVE. ARABIC LETTER HAH WITH HAMZA ABOVE (U+0681) and ARABIC LETTER | |||
| REH WITH HAMZA ABOVE (U+076C) do not have decomposition forms and are | REH WITH HAMZA ABOVE (U+076C) do not have decomposition forms and are | |||
| preferred over combining sequences using HAMZA ABOVE (U+0654) | preferred over combining sequences using HAMZA ABOVE (U+0654) | |||
| [Unicode62-Hamza]. By contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE | [Unicode70-Hamza]. By contrast, ARABIC LETTER ALEF WITH HAMZA ABOVE | |||
| (U+0623) decomposes into \u'0627'\u'0654' and ARABIC LETTER YEH WITH | (U+0623) decomposes into \u'0627'\u'0654', ARABIC LETTER WAW WITH | |||
| HAMZA ABOVE (U+0626) decomposes into \u'064A'\u'0654' so the | HAMZA ABOVE (U+0624) decomposes into \u'0648'\u'0654', and ARABIC | |||
| precomposed character and combining sequences compare equal when both | LETTER YEH WITH HAMZA ABOVE (U+0626) decomposes into \u'064A'\u'0654' | |||
| are normalized, as this specification prefers. | so the precomposed character and combining sequences compare equal | |||
| when both are normalized, as this specification prefers. | ||||
| There are other variations in which a precomposed character involving | There are other variations in which a precomposed character involving | |||
| HAMZA ABOVE has a decomposition to a combining sequence that can form | HAMZA ABOVE has a decomposition to a combining sequence that can form | |||
| it. For example, ARABIC LETTER U WITH HAMZA ABOVE (U+0677) has a | it. For example, ARABIC LETTER U WITH HAMZA ABOVE (U+0677) has a | |||
| compatibility (???) decomposition into the combining sequence | compatibility decomposition. but not a canonical one, into the | |||
| \u'06C7'\u'0674'. | combining sequence \u'06C7'\u'0674'. | |||
| 2.4. Hamza and Combining Sequences | 3.2.3. Hamza and Combining Sequences | |||
| As the Unicode Standard points out at some length [Unicode62-Arabic], | As the Unicode Standard points out at some length [Unicode70-Arabic], | |||
| Hamza is a problematic abstract character and the "Hamza Above" | Hamza is a problematic abstract character and the "Hamza Above" | |||
| construction even more so [Unicode62-Hamza]. Those sections explain | construction even more so [Unicode70-Hamza]. Those sections explain | |||
| a distinction made by Unicode between the use of a Hamza mark to | a distinction made by Unicode between the use of a Hamza mark to | |||
| denote a glottal stop and one used as a diacritic mark to denote a | denote a glottal stop and one used as a diacritic mark to denote a | |||
| separate letter. In the first case, the combining sequence is used. | separate letter. In the first case, the combining sequence is used. | |||
| In the second, a precombined character is assigned. | In the second, a precomposed character is assigned. | |||
| Unlike Unicode generally and because of concerns about identifier | Unlike Unicode generally and because of concerns about identifier | |||
| spoofing and attacks based on similarities, character distinctions in | spoofing and attacks based on similarities, character distinctions in | |||
| IDNA are based much more strictly on the appearance of characters; | IDNA are based much more strictly on the appearance of characters; | |||
| language and pronunciation distinctions within a script are not | language and pronunciation distinctions within a script are not | |||
| considered. So, for IDNA, BEH WITH HAMZA ABOVE is not-quite- | considered. So, for IDNA, BEH WITH HAMZA ABOVE is not-quite- | |||
| tautologically the same as BEH WITH HAMZA ABOVE, even if one of them | tautologically the same as BEH WITH HAMZA ABOVE, even if one of them | |||
| is written as U+08A1 (new to Unicode 7.0.0) and the other as the | is written as U+08A1 (new to Unicode 7.0.0) and the other as the | |||
| sequence \u'0628'\u'0654' (feasible with Unicode 7.0.0 but also | sequence \u'0628'\u'0654' (feasible with Unicode 7.0.0 but also | |||
| available in versions of Unicode going back at least to the version | available in versions of Unicode going back at least to the version | |||
| [Unicode32] used in the original version of IDNA [RFC3490]. Because | [Unicode32] used in the original version of IDNA [RFC3490]. Because | |||
| the precomposed form and combining sequence are, for IDNA purposes, | the precomposed form and combining sequence are, for IDNA purposes, | |||
| the same, IDNA expects that normalization (specifically the | the same, IDNA expects that normalization (specifically the | |||
| requirement that all U-labels be in NFC form) will cause them to | requirement that all U-labels be in NFC form) will cause them to | |||
| compare equal. | compare equal. | |||
| If Unicode also considered them the same, then the principle would | If Unicode also considered them the same, then the principle would | |||
| apply that new precomposed ("composition") forms are not added unless | apply that new precomposed ("composition") forms are not added unless | |||
| one of the code points that could be used to construct it did not | one of the code points that could be used to construct it did not | |||
| exist in an earlier version (and even then is | exist in an earlier version (and even then is discouraged) | |||
| discouraged)[UAX15-Versioning]. When exceptions are made, they are | [UAX15-Versioning]. When exceptions are made, they are expected to | |||
| expected to conform to the rules and classes in the "Composition | conform to the rules and classes in the "Composition Exclusion | |||
| Exclusion Table", with class 2 being relevant to this case | Table", with class 2 being relevant to this case [UAX15-Exclusion]. | |||
| [UAX15-Exclusion]. That rule essentially requires that the | That rule essentially requires that the normalization for the old | |||
| normalization for the old combining sequence to itself be retained | combining sequence to itself be retained (for stability) but that the | |||
| (for stability) but that the newly-added character be treated as | newly-added character be treated as canonically decomposable and | |||
| canonically decomposable and decompose back to the older sequence | decompose back to the older sequence even under NFC. That was not | |||
| even under NFC. That was not done for this particular case, | done for this particular case, presumably because of the distinction | |||
| presumably because of the distinction about pronunciation modifiers | about pronunciation modifiers versus separate letters noted above. | |||
| versus separate letters noted above. Because, for IDNA and the DNS, | Because, for IDNA and the DNS, there is a possibility that the | |||
| there is a possibility that the composing sequence \u'0628'\u'0654' | composing sequence \u'0628'\u'0654' already appears in labels, the | |||
| already appears in labels, the only choice other than allowing an | only choice other than allowing an otherwise-identical, and | |||
| otherwise-identical, and identically-appearing, label with U+08A1 | identically-appearing, label with U+08A1 substituted to identify a | |||
| substituted to identify a different DNS entry is to DISALLOW the new | different DNS entry is to DISALLOW the new character. | |||
| character. | ||||
| 3. Proposed/ Alternative Changes to RFC 5892 for new character U+08A1 | 3.3. Precomposed characters without decompositions more generally | |||
| 3.3.1. Description of the general problem | ||||
| As mentioned above, IDNA made a strong assumption that, if there were | ||||
| two ways to form the same abstract character in the same script, | ||||
| normalization would result in them comparing equal. Work on IDNA2008 | ||||
| recognized that early version of Unicode might also contain some | ||||
| inconsistencies; see Section 3.3.2.4 below. | ||||
| Having precomposed code points exist that don't have decompositions, | ||||
| or having them allocated in the future, is problematic for those IDNA | ||||
| assumptions about character comparison, and seems to call for either | ||||
| excludng some set of code points that IDNA's rules do not now | ||||
| identify, to develop and use a normalization procedure that behaves | ||||
| as expected (those two options may be nearly equivalent for many | ||||
| purposes) or deciding to accept a risk that, apparently, will only | ||||
| increase over time. | ||||
| It is not clear whether the reasons the IDNABIS WG did not understand | ||||
| and allow for these cases are important except insofar as they inform | ||||
| considerations about what to do in the future. It seemed (and still | ||||
| seems to some people) that the Unicode Standard is very clear on the | ||||
| matter. In addition to the normalization stability rules cited in | ||||
| the last part of Section 3.1. the discussion in the Core Standard | ||||
| seems quite clear. For example, "Where characters are used in | ||||
| different ways in different languages, the relevant properties are | ||||
| normally defined outside the Unicode Standard" in Section 2.2, | ||||
| subsection titled "Semantics" [Unicode7] did not suggest to most | ||||
| readers that sometime separate code points would be allocated within | ||||
| a script based on language considerations. Similarly, the same | ||||
| section of the Standard says, in a subsection titled "Unification", | ||||
| "The Unicode Standard avoids duplicate encoding of characters by | ||||
| unifying them within scripts across language" and does not list | ||||
| exceptions to that rule or limit it to a single script although it | ||||
| goes on to list "CJK" as an example. Another subsection, "Equivalent | ||||
| Sequences" indicates "Common precomposed forms ... are included for | ||||
| compatibility with current standards. For static precomposed forms, | ||||
| the standard provides a mapping to an equivalent dynamically composed | ||||
| sequence of characters". The latter appears to be precisely the "all | ||||
| precomposed characters decompose into the relevant combining | ||||
| sequences if the relevant base and combining characters exist in the | ||||
| Standard" that IDNA needs and assumed and, again, there is no mention | ||||
| of exceptions, language-dependent of otherwise. The summary of | ||||
| stabiiity policies cited in the Standard [Unicode70-Stability] does | ||||
| not appear to shed any additional light on these issues. | ||||
| The Standard now contains a subsection titled "Non-decomposition of | ||||
| Overlaid Diacritics" [Unicod70-Overlay] that identifies a list of | ||||
| diacritics that do not normally form characters that have | ||||
| decompositions. The rule given has its own exceptions and the text | ||||
| clearly states that there is actually no way to know whether a code | ||||
| point has a decomposition other than consulting the Unicode Character | ||||
| Database entry for that code point. The subsequent section notes | ||||
| that this can be a security problem; while the issues with IDNA go | ||||
| well beyond what is normally considered security, that comment now | ||||
| seems clear. While that subsection is helpful in explaining the | ||||
| problem, especially for European scripts, it does not appear in the | ||||
| Unicode versions that were current when IDNA2008 was being developed. | ||||
| 3.3.2. Latin Examples and Cases | ||||
| While this set of problems was discovered because of a code point | ||||
| added to the Arabic script in precombined form to support a | ||||
| particular language, there are actually far more examples for, e.g., | ||||
| Latin script than there are for Arabic script. Many of them are | ||||
| associated with the "non-decomposition of combining diacriticals" | ||||
| issues mentioned above, but the next subsections describe other cases | ||||
| that are not directly bound to decomposition. | ||||
| 3.3.2.1. The font exclusion and compatability relationships | ||||
| Unicode contains a large collection of characters that are identified | ||||
| as "Mathematical Symbols". A large subset of them are basic or | ||||
| decorated Latin characters, differing from the ordinary ones only by | ||||
| their usage and, in appearance, by font or type styling (despite the | ||||
| general principle that font distinctions are not used as the basis | ||||
| for assigning separate code points. Most of these have canonical | ||||
| mappings to the base form, which eliminates them from IDNA, but | ||||
| others do not and, because the same marks that are used as phonetic | ||||
| diacritical markings in conventional alphabetical use have special | ||||
| mathematical meanings, applications that permit the use of these | ||||
| characters have their own issues with normalization and equality. | ||||
| 3.3.2.2. The phonetic notation characters and extensions | ||||
| Another example involves various Phonetic Alphabet and Extension | ||||
| characters. many of which, unlike the Mathematical ones, do not have | ||||
| normalizations that would make them compare equal to the basic | ||||
| characters with essentially identical representations. This would | ||||
| not be a problem for IDNA if they were identified with a specialize | ||||
| script or as symbols rather than letters, but neither is the case: | ||||
| they are generally identified as lower case Latin Script letters even | ||||
| when they are visually upper-case, another issue for IDNA. | ||||
| 3.3.2.3. Combineng dots and other shapes combine... unless... | ||||
| The discussion of "Non-decomposition of Overlaid Diacritics" | ||||
| [Unicod70-Overlay] indirectly exhibits at least one reason why it has | ||||
| been difficult to characterize the problem. If one combines that | ||||
| subsection with others, one gets a set of rules that might be | ||||
| described as: | ||||
| 1. If the precomposed character and the code points that make up the | ||||
| combining sequence exist, then canonical composition and | ||||
| decomposition work as expected, except... | ||||
| 2. If the precomposed character was added to Unicode after the code | ||||
| points that make up the combining sequence, normalization | ||||
| stability for the combining sequences requires that NFC applied | ||||
| to the precomposed character decomposes rather than having the | ||||
| combining sequence compose to the new character, however... | ||||
| 3. If the combining sequence involves a diacritic or other mark that | ||||
| actually touches the base character when composed, the | ||||
| precomposed character does not have a decomposition, unless... | ||||
| 4. The combining diacritic involved is Cedilla (U+0327), Ogonek | ||||
| (U+0328), or Horn (U+031B), in which case the precomposed | ||||
| characters that contain them "regularly" (but presumably not | ||||
| always), and... | ||||
| 5. There are further exceptions for Hamza (which does not overlay | ||||
| the associated base character in the same way the Latin-derived | ||||
| combining diacritics and other marks do. Those decisions to | ||||
| decompose a precomposed character (or not) are based on language | ||||
| or phonetic considerations, not the combining mechanism or | ||||
| appearance, or perhaps,... | ||||
| 6. Some characters have compatibility decompositions rather than | ||||
| canonical ones [Unicod70-CompatDecomp]. Because compatibility | ||||
| relationships are treated differently by IDNA, PRECIS | ||||
| [PRECIS-Framework], and, potentially, other protocols involving | ||||
| identifiers for Internet use, the existence of compatibility | ||||
| relationship may or may not be helpful. Finally,... | ||||
| 7. There is no reason to believe the above list is complete. In | ||||
| particular, if whether a precomposed character decomposes or not | ||||
| is determined by language or phonetic distinctions, one may need | ||||
| additional rules on a per-script and/or per-character basis. | ||||
| The above list only covers the cases involving combining sequences. | ||||
| It does not cover cases such as those in Section 3.3.2.1 and | ||||
| Section 3.3.2.2 and there may be additional groups of cases not yet | ||||
| identified. | ||||
| 3.3.2.4. "Legacy" characters and new additions | ||||
| The development of categories and rules for IDNA recognized that | ||||
| early version of Unicode might contain some inconsistencies if | ||||
| evaluated using more contemporary rules about code point assignments | ||||
| and stability. In particular, there might be some exceptions from | ||||
| different practices in early version of Unicode or anomalies caused | ||||
| by copying existing single- or dual-script standards into Unicode as | ||||
| block rather than individual character additions to the repertoire. | ||||
| The possibility of such "legacy" exceptions was one reason why the | ||||
| IDNA category rules include explicit provisions for exception lists | ||||
| (even though no such code points were identified prior to 2014). | ||||
| 3.3.3. Examples and Cases from Other Scripts | ||||
| Research into these issues has not yet turned up a comprehensive list | ||||
| of affected scripts and code points. As discussed elsewhere in this | ||||
| document, it is clear that Arabic and Latin Scripts are significantly | ||||
| affected, that some Han and Kangxu radicals and ideographs are | ||||
| affected, and that other examples do exist -- it is just not known | ||||
| how many of those examples there are and what patterns, if any, | ||||
| characterize them. | ||||
| 3.3.4. Scripts with precomposed preferences and ones with combining | ||||
| preferences | ||||
| While the authors have been unable to find an explanation for the | ||||
| differentiation in the Unicode Standard, we have been told that there | ||||
| are differences among scripts as to whether the action preference is | ||||
| to add new combining sequences only (and resist adding precomposed | ||||
| characters) as suggested in Section 3.3.2.3 or to add precomposed | ||||
| characters, often ones that do not have decompositions. If those | ||||
| difference in preference do exist, it is probably important to have | ||||
| them documented so that they can be reflected in IDNA review | ||||
| procedures and elsewhere. It will also require IETF discussion of | ||||
| whether combining sequences should be deprecated when the | ||||
| corresponding precomposed characters are added or to disallow | ||||
| combining sequences entirely for those scripts (as has been | ||||
| implicitly suggested for Arabic language use [RFC5564]). | ||||
| [[CREF1: The above isn't quite right and probably needs additional | ||||
| discussion and text.]] | ||||
| 3.4. Confusion and the casual user | ||||
| To the extent to which predictability for relatively casual users is | ||||
| a desired and important feather of relevant application or | ||||
| application support protocols, it is probably worth observing that | ||||
| the complex of rules and cases above is almost certainly too involved | ||||
| for the typical such user to develop a good intuitive understanding | ||||
| of how things behave and what relationships exist. | ||||
| 4. Implementation options and issues: Unicode properties, exceptions, | ||||
| and the nature of stability | ||||
| 4.1. Unicode Stability compared to IETF (and ICANN) Stability | ||||
| The various stability rules in Unicode [Unicode70-Stability] all | ||||
| appear to be based on the model that once a value is assigned, it can | ||||
| never be changed. That is probably appropriate for a character | ||||
| coding system with multiple uses and applications. It is probably | ||||
| the only option when normative relationships are expressed in tables | ||||
| of values rather than by rules. One consequence of such a model is | ||||
| that it is difficult or impossible to fix mistakes (for some | ||||
| stability rules, the Unicode Standard does provide for exceptions) | ||||
| and even harder to make adjustments that would normally be dictated | ||||
| by evolution. | ||||
| "No changes" provides a very strong and predictable type of stability | ||||
| and there are many reasons to take that path. As in some of the | ||||
| cases that motivated this document, the difficulty is that simply | ||||
| adding new code points (in Unicode) or features (in a protocol or | ||||
| application) may be destabilizing. One then has complete stability | ||||
| for systems that never use or allow the new code points or features, | ||||
| but rough edges for newer systems that see the discrepancies and | ||||
| rough edges. IDNA2003 (inadvertently) took that approach by freezing | ||||
| on Unicode 3.2 -- if no code points added after Unicode 3.2 had ever | ||||
| been allowed, we would have had complete stability even as Unicode | ||||
| libraries changed. Unicode has been quite ingenious about working | ||||
| around those difficulties with such provisions as having code points | ||||
| for newly-added precomposed characters decompose rather than altering | ||||
| the normalization for the combining sequences. Other cases, such as | ||||
| newly-added precomposed characters that do not decompose for, e.g., | ||||
| language or phonetic reasons, are more problematic. | ||||
| The IETF (and ICANN and standards development bodies such as ISO and | ||||
| ISO/IEC JTC1) have generally adopted a different type of stability | ||||
| model, one which considers experience in use and the ill effects of | ||||
| not making changes as well as the disruptive effects of doing so. In | ||||
| the IETF model, if an earlier decision is causing sufficient harm and | ||||
| there is consensus in the communities that are most affected that a | ||||
| change is desirable enough to make transition costs acceptable, then | ||||
| the change is made. | ||||
| The difference and its implications are perhaps best illustrated by a | ||||
| disagreement when IDNA2008 was being approved. IDNA2003 had | ||||
| effectively prevented some characters, notably (measured by intensity | ||||
| of the protests) the Sharp S character (U+00DF) from being used in | ||||
| DNS labels by mapping them to other characters before conversion to | ||||
| ACE form. It has also prohibited some other code points, notably ZWJ | ||||
| (U+200D) and ZWNJ (U+200C), by discarding them. In both cases, there | ||||
| were strong voices from the relevant language communities, supported | ||||
| by the registry communities, that the characters were important | ||||
| enough that it was more desirable to undergo the short-term pain of a | ||||
| transition and some uncertainty than to continue to exclude those | ||||
| characters and the IDNA2008 rules and repertoire are consistent with | ||||
| that preference. The Unicode Consortium apparently believed that | ||||
| stability --elimination of any possibility of label invalidation or | ||||
| different interpretations of the same string-- was more important | ||||
| than those writing system requirements and community preferences. | ||||
| That view was expressed through what was effectively a fork in (or | ||||
| attempt to nullify) the IETF Standard [UTS46] a result that has | ||||
| probably been worse for the overall Internet than either of the | ||||
| possible decision choices. | ||||
| 4.2. New Unicode Properties | ||||
| One suggestion about the way out of these problems would be to create | ||||
| one or more new Unicode properties, maintained along with the rest of | ||||
| Unicode, and then incorporated into new or modified rules or | ||||
| categories in IDNA. Given the analysis in this document, it appears | ||||
| that that property (or properties) would need to provide: | ||||
| 1. Identification of combining characters that, when used in | ||||
| combining sequences, do not produce decomposable characters. | ||||
| [[CREF2: Wording on the above is not quite right but, for the | ||||
| present, maybe the intent is clear.]] | ||||
| 2. Identification of precomposed characters that might reasonably be | ||||
| expected to decompose, but that do not. | ||||
| 3. Identification of character forms that are distinct only because | ||||
| of language or phonetic distinctions within a script. | ||||
| 4. Identification of scripts for which precomposed forms are | ||||
| strongly preferred and combining sequences should either be | ||||
| viewed as temporary mechanisms until precomposed characters are | ||||
| assigned or banned entirely. | ||||
| 5. Identification of code points that represent symbols for | ||||
| specific, non-language, purposes even if identified as letters or | ||||
| numerals by their General Property (see Section 3.3.2.2 and | ||||
| Section 3.3.2.1). | ||||
| Some of these properties (or characteristics or values of a single | ||||
| property) would be suitable for disallowing characters, code points, | ||||
| or contextual sequences that otherwise might be allowed by IDNA. | ||||
| Others would be more suitable for making equality comparisons come | ||||
| out as needed by IDNA, particularly to eliminate distinctions based | ||||
| on language context. | ||||
| While it would appear that appropriate rules and categories could be | ||||
| developed for IDNA (and, presumably, for PRECIS, etc.) if the problem | ||||
| areas are those identified in this document, it is not yet known | ||||
| whether the list is complete (and, hence, whether additional | ||||
| properties or information would be needed. | ||||
| Even with such properties, IDNA would still almost certainly need | ||||
| exception lists. In addition, it is likely that stability rules for | ||||
| those properties would need to reflect IETF norms with arrangements | ||||
| for bringing the IETF and other communities into the discussion when | ||||
| tradeoffs are reviewed. | ||||
| 4.3. The need for exception lists | ||||
| [[CREF3: Note in draft: this section is a partial placeholder and may | ||||
| need more elaboration.]] | ||||
| Issues with exception lists and the requirements for them are | ||||
| discussed in Section 2 above and RFC 5894 [RFC5894]. | ||||
| 5. Proposed/ Alternative Changes to RFC 5892 for the issues first | ||||
| exposed by new code point U+08A1 | ||||
| NOTE IN DRAFT: See the comments in the Introduction, Section 1 and | NOTE IN DRAFT: See the comments in the Introduction, Section 1 and | |||
| the first paragraph of each Subsection below for the status of the | the first paragraph of each Subsection below for the status of the | |||
| Subsections that follow. Each one, in combination with the material | Subsections that follow. Each one, in combination with the material | |||
| in Section 2 above, also provides information about the reasons why | in Section 3 above, also provides information about the reasons why | |||
| that particular strategy is appropriate. | that particular strategy might or might not be appropriate. | |||
| 3.1. Disallow This New Code Point | 5.1. Disallow This New Code Point | |||
| This option is almost certainly too Arabic-specific and does not | ||||
| solve, or even address, the underlying problem. It also does not | ||||
| inherently generalize to non-decomposing precomposed code points that | ||||
| might be added in the future (whether to Arabic or other scripts) | ||||
| even though one could add more code points to Category F in the same | ||||
| way. | ||||
| If chosen by the community, this subsection would update the portion | If chosen by the community, this subsection would update the portion | |||
| of the IDNA2008 specification that identifies rules for what | of the IDNA2008 specification that identifies rules for what | |||
| characters are permitted [RFC5892] to disallow that code point. | characters are permitted [RFC5892] to disallow that code point. | |||
| With the publication of this document, Section 2.6 ("Exceptions (F)") | With the publication of this document, Section 2.6 ("Exceptions (F)") | |||
| of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in | of RFC 5892 [RFC5892] is updated by adding 08A1 to the rule in | |||
| Category F so that the rule itself reads: | Category F so that the rule itself reads: | |||
| F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660, | F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660, | |||
| skipping to change at page 10, line 13 ¶ | skipping to change at page 19, line 28 ¶ | |||
| Section 5.3). However, that category is described as applying only | Section 5.3). However, that category is described as applying only | |||
| when "property values in versions of Unicode after 5.2 have changed | when "property values in versions of Unicode after 5.2 have changed | |||
| in such a way that the derived property value would no longer be | in such a way that the derived property value would no longer be | |||
| PVALID or DISALLOWED". Because U+08A1 is a newly-added code point in | PVALID or DISALLOWED". Because U+08A1 is a newly-added code point in | |||
| Unicode 7.0.0 and no property values of code points in prior versions | Unicode 7.0.0 and no property values of code points in prior versions | |||
| have changed, category G does not apply. If that section of RFC 5892 | have changed, category G does not apply. If that section of RFC 5892 | |||
| were to be replaced in the future, perhaps consideration should be | were to be replaced in the future, perhaps consideration should be | |||
| given to adding Normalization Stability and other issues to that | given to adding Normalization Stability and other issues to that | |||
| description but, at present, it is not relevant. | description but, at present, it is not relevant. | |||
| 3.2. Disallow the combining sequences for these characters | 5.2. Disallow This New Code Point and All Future Precomposed Additions | |||
| that do not decompose | ||||
| At least in principle, the approach suggested above (Section 5.1) | ||||
| could be expanded to disallow all future allocations of non- | ||||
| decomposing precomposed characters. This would probably require | ||||
| either a new Unicode property to identify such characters and/or more | ||||
| emphasis on the manual, individual code point, checking of the new | ||||
| Unicode version review proces (i.e,. not just application of the | ||||
| existing rules and algorithm). It might require either a new rule in | ||||
| IDNA or a modification to the structure of Category F to make | ||||
| additions less tedious. It would do nothing for different ways to | ||||
| form identical characters within the same script that were not | ||||
| associated with decomposition and so would have to be used in | ||||
| conjunction with other appropaches. Finally, for scripts (such as | ||||
| Arabic) where there is a very strong preference to avoid combining | ||||
| sequences, this approach would exclude exactly the wrong set of | ||||
| characters. | ||||
| 5.3. Disallow the combining sequences for these characters | ||||
| As in the approach discussed in Section 5.1, this approach is too | ||||
| Arabic-specific to address the more general problem. However, it | ||||
| illustrates a single-script approach and a possible mechanism for | ||||
| excluding combining sequences whose handling is connected to language | ||||
| information (information that, as discussed above, is not relevant to | ||||
| the DNS). | ||||
| If chosen by the community, this subsection would update the portion | If chosen by the community, this subsection would update the portion | |||
| of the IDNA2008 specification that identifies contextual rules | of the IDNA2008 specification that identifies contextual rules | |||
| [RFC5892] to prohibit (combining) Hamza Above (U+0654) in conjunction | [RFC5892] to prohibit (combining) Hamza Above (U+0654) in conjunction | |||
| with Arabic BEH (U+0628), HAH (U+062D), and REH (U+0631). Note that | with Arabic BEH (U+0628), HAH (U+062D), and REH (U+0631). Note that | |||
| the choice of this option is consistent with the general preference | the choice of this option is consistent with the general preference | |||
| for precomposed characters discussed above but would ban some labels | for precomposed characters discussed above but would ban some labels | |||
| that are valid today and that might, in principle, be in use. | that are valid today and that might, in principle, be in use. | |||
| The required prohibition could be imposed by creating a new | The required prohibition could be imposed by creating a new | |||
| contextual rule in RFC 5892 to constrain combining sequences | contextual rule in RFC 5892 to constrain combining sequences | |||
| containing Hamza Above. | containing Hamza Above. | |||
| As the Unicode Standard points out at some length [Unicode62-Arabic], | As the Unicode Standard points out at some length [Unicode70-Arabic], | |||
| Hamza is a problematic abstract character and the "Hamza Above" | Hamza is a problematic abstract character and the "Hamza Above" | |||
| construction even more so. IDNA has historically associated | construction even more so. IDNA has historically associated | |||
| characters whose use is reasonable in some contexts but not others | characters whose use is reasonable in some contexts but not others | |||
| with the special derived property "CONTEXTO" and then specified | with the special derived property "CONTEXTO" and then specified | |||
| specific, context-dependent, rules about where they may be used. | specific, context-dependent, rules about where they may be used. | |||
| Because Hamza Above is problematic (and spawns edge cases, as | Because Hamza Above is problematic (and spawns edge cases, as | |||
| discussed in the Unicode Standard section cited above), it was | discussed in the Unicode Standard section cited above), it was | |||
| suggested that a contextual rule might be appropriate. There are at | suggested that a contextual rule might be appropriate. There are at | |||
| least two reasons why a contextual rule would not be suitable for the | least two reasons why a contextual rule would not be suitable for the | |||
| present situation. | present situation. | |||
| skipping to change at page 11, line 12 ¶ | skipping to change at page 21, line 5 ¶ | |||
| characters within that script. Neither of these cases applies to | characters within that script. Neither of these cases applies to | |||
| the newly-added character even if one could imagine rules for the | the newly-added character even if one could imagine rules for the | |||
| use of Hamza Above (U+0654) that would reflect the considerations | use of Hamza Above (U+0654) that would reflect the considerations | |||
| of Chapter 8 of Unicode 6.2. Even had the latter been desired, | of Chapter 8 of Unicode 6.2. Even had the latter been desired, | |||
| it would be somewhat late now -- Hamza Above has been present as | it would be somewhat late now -- Hamza Above has been present as | |||
| a combining character (U+0654) in many versions of Unicode. | a combining character (U+0654) in many versions of Unicode. | |||
| While that section of the Unicode Standard describes the issues, | While that section of the Unicode Standard describes the issues, | |||
| it does not provide actionable guidance about what to do about it | it does not provide actionable guidance about what to do about it | |||
| for cases going forward or when visual identity is important. | for cases going forward or when visual identity is important. | |||
| 3.3. Do Nothing Other Than Warn | 5.4. Disallow all Combining Characters for Specific Scripts | |||
| [[CREF4: This subsevtion needs to be turned into prose, but the | ||||
| follow bullet points are probably sufficient to identify the | ||||
| issues.]] | ||||
| Might work for Arabic and other "precomposed preference" scripts (see | ||||
| Section 3.3.4; recommended by the Arabic language community for IDNs | ||||
| [RFC5564]. Hopeless for Latin. Backwards incompatible. No effect | ||||
| at all on special-use representations of identical characters within | ||||
| a script (see Section 3.3.2.1 and Section 3.3.2.2). | ||||
| 5.5. Do Nothing Other Than Warn | ||||
| The recommendation from UTC is to simply warn registries, at all | The recommendation from UTC is to simply warn registries, at all | |||
| levels of the tree, to be careful with this set of characters, making | levels of the tree, to be careful with this set of characters, making | |||
| language distinctions within zones. Because the DNS cannot make or | language distinctions within zones. Because the DNS cannot make or | |||
| enforce language distinctions, this suggestion is problematic but it | enforce language distinctions, this suggestion is problematic but it | |||
| would avoid having the IETF either invalidating label strings that | would avoid having the IETF either invalidating label strings that | |||
| are potentially now in use or creating inconsistencies among the | are potentially now in use or creating inconsistencies among the | |||
| characters that combine with Hamza Above but that also have | characters that combine with Hamza Above but that also have | |||
| precomposed forms that do not have decompositions. The potential | precomposed forms that do not have decompositions. The potential | |||
| would still exist for registries to respect the warning and deprecate | would still exist for registries to respect the warning and deprecate | |||
| such labels if they existed. | such labels if they existed. | |||
| 3.4. Normalization Form IETF (or DNS) | 5.6. Normalization Form IETF (NFI)) | |||
| The most radical possibility would be to decide that none of the | The most radical possibility for the comparison issue would be to | |||
| Unicode Normalization Forms specified in UAX 15 [UAX15] are adequate | decide that none of the Unicode Normalization Forms specified in UAX | |||
| for use with the DNS because, contrary to their apparent | 15 [UAX15] are adequate for use with the DNS because, contrary to | |||
| descriptions, normalization tables are actually determined using | their apparent descriptions, normalization tables are actually | |||
| language information. However, use of language information is | determined using language information. However, use of language | |||
| unacceptable for IDNA for reasons described elsewhere in this | information is unacceptable for IDNA for reasons described elsewhere | |||
| document. The remedy would be to define an IETF-specific (or DNS- | in this document. The remedy would be to define an IETF-specific (or | |||
| specific) normalization form, building on NFC but adhering strictly | DNS-specific) normalization form (sometimes called "NFI" in | |||
| to the rule that normalization causes two different forms of the same | discussions), building on NFC but adhering strictly to the rule that | |||
| character (glyph image) within the same script to be treated as | normalization causes two different forms of the same character (glyph | |||
| equal. In practice such a form would be implemented for IDNA | image) within the same script to be treated as equal. In practice | |||
| purposes as an additional rule within RFC 5892 (and its successors) | such a form could be implemented for IDNA purposes as an additional | |||
| that constituted an exception list for the NFC tables. For this set | rule within RFC 5892 (and its successors) that constituted an | |||
| of characters, the special IETF normalization form would be | exception list for the NFC tables. For this set of characters, the | |||
| equivalent to the exclusion discussed in Section 3.2 above. | special IETF normalization form would be equivalent to the exclusion | |||
| discussed in Section 5.3 above. | ||||
| 4. Editorial clarification to RFC 5892 | An Internet-specific normalization form, especially if specified | |||
| somewhat separately from the IDNA core, would have a small marginal | ||||
| advantage over the other strategies in this section (or in | ||||
| combination with some of them), even though most of the end result | ||||
| and much of the implementation would be the same in practice. While | ||||
| the design of IDNA requires that strings be normalized as part of the | ||||
| process of determining label validity (and hence before either | ||||
| storage of values in the DNS or name resolution), there is an ongoing | ||||
| debate about whether normalization should be performed before storing | ||||
| a string or putting it on the wire or only when the string is | ||||
| actually compared or otherwise used. | ||||
| If a normalization procedure with the right properties for the IETF | ||||
| was defined, that argument could be bypassed and the best decisions | ||||
| made for different circumstances. The separation would also allow | ||||
| better comparison of strings that lack language context in | ||||
| applications environments in which the additional processing and | ||||
| character classifications of IDNA and/or PRECIS were not applicable. | ||||
| Having such a normalization procedure defined outside IDNA would also | ||||
| minimize changes to IDNA itself, which is probably an advantage. | ||||
| If the new normalizstion form were, in practice, simply an overlay on | ||||
| NFC with modifications dictated by exception and/or property lists, | ||||
| keeping its definition separate from IDNA would also avoid | ||||
| interweaving those exceptions and property lists with the rules and | ||||
| categories of IDNA itself, avoiding some unnecessary complexity. | ||||
| 6. Editorial clarification to RFC 5892 | ||||
| Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a | Verified RFC Editor Erratum 3312 [RFC5892Erratum] provides a | |||
| clarification to Appendix A and Section A.1 of RFC 5892. This | clarification to Appendix A and Section A.1 of RFC 5892. This | |||
| section of this document updates the RFC to apply that clarification. | section of this document updates the RFC to apply that clarification. | |||
| 1. In Appendix A, add a new paragraph after the paragraph that | 1. In Appendix A, add a new paragraph after the paragraph that | |||
| begins "The code point...". The new paragraph should read: | begins "The code point...". The new paragraph should read: | |||
| "For the rule to be evaluated to True for the label, it MUST be | "For the rule to be evaluated to True for the label, it MUST be | |||
| evaluated separately for every occurrence of the Code point in | evaluated separately for every occurrence of the Code point in | |||
| skipping to change at page 12, line 18 ¶ | skipping to change at page 23, line 5 ¶ | |||
| 2. In Appendix A, Section A.1, replace the "Rule Set" by | 2. In Appendix A, Section A.1, replace the "Rule Set" by | |||
| Rule Set: | Rule Set: | |||
| False; | False; | |||
| If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True; | If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True; | |||
| If cp .eq. \u200C And | If cp .eq. \u200C And | |||
| RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp | RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*cp | |||
| (Joining_Type:T)*(Joining_Type:{R,D})) Then True; | (Joining_Type:T)*(Joining_Type:{R,D})) Then True; | |||
| 5. Acknowledgements | 7. Acknowledgements | |||
| The Unicode 7.0.0 changes were extensively discussed within the IAB's | The Unicode 7.0.0 changes were extensively discussed within the IAB's | |||
| Internationalization Program. The authors are grateful for the | Internationalization Program. The authors are grateful for the | |||
| discussions and feedback there, especially from Andrew Sullivan and | discussions and feedback there, especially from Andrew Sullivan and | |||
| David Thaler. Additional information was requested and received from | David Thaler. Additional information was requested and received from | |||
| Mark Davis and Ken Whistler and while they probably do not agree with | Mark Davis and Ken Whistler and while they probably do not agree with | |||
| the necessity of excluding this code point or taking even more | the necessity of excluding this code point or taking even more | |||
| drastic action as their responsibility is to look at the Unicode | drastic action as their responsibility is to look at the Unicode | |||
| Consortium requirements for stability, the decision would not have | Consortium requirements for stability, the decision would not have | |||
| been possible without their input. Thanks to Bill McQuillan and Ted | been possible without their input. Thanks to Bill McQuillan and Ted | |||
| Hardie for reading versions of the document carefully enough to | Hardie for reading versions of the document carefully enough to | |||
| identify and report some confusing typographical errors. Several | identify and report some confusing typographical errors. Several | |||
| experts and reviewers who prefer to remain anonymous also provided | experts and reviewers who prefer to remain anonymous also provided | |||
| helpful input and comments on preliminary versions of this document. | helpful input and comments on preliminary versions of this document. | |||
| 6. IANA Considerations | 8. IANA Considerations | |||
| When the IANA registry and tables are updated to reflect Unicode | When the IANA registry and tables are updated to reflect Unicode | |||
| 7.0.0, changes should be made according to the decisions the IETF | 7.0.0, changes should be made according to the decisions the IETF | |||
| makes about Section 3. | makes about Section 5. | |||
| 7. Security Considerations | 9. Security Considerations | |||
| [[CREF1: NOTE IN DRAFT: This section is unchanged in version -01 of | From at least one point of view, this document is entirely a | |||
| this document relative to what appeared in -00. It will need to be | discussion of a security issue or set of such issues. While the | |||
| rewritten once decisions are made about what path to follow. In | "similar-looking characters" issue that has been a concern since the | |||
| particular, if "just warn" is chosen, it will need to contain very | earliest days of IDNs [HomographAttack] and that has driven assorted | |||
| strong warnings.]] | "character confusion" projects [ICANN-VIP], if a user types in a | |||
| string on one device and can get different results that do not | ||||
| compare equal when it is typed on a different device (with both | ||||
| behaving correctly and both keyboards appearing to be the same and | ||||
| for the same script) then all security mechanism that depend on the | ||||
| underlying identifiers, including the practical applications of DNS | ||||
| response integrity checks DNSSEC [RFC4033] and DNS-embedded public | ||||
| key mechanisms [RFC6698], are at risk if different parties, at least | ||||
| one of them malicious, obtain some of the identical-appearing and | ||||
| identically-typed strings. | ||||
| Mechanisms that depend on trusting registration systems (e.g., | ||||
| registries and registrars in the DNS IDN case, see Section 5.5 above) | ||||
| are likely to be of only limited utility because fully-qualified | ||||
| domains that may be perfectly reasonable at the first level or two of | ||||
| the DNS may have differences of this type deep in the tree, into | ||||
| levels where name management is weak. Similar issues obviously apply | ||||
| when names are user-selected or unmanaged. | ||||
| When the issue is not a deliberate attack but simple accidental | ||||
| confusion among similar strings, most of our strategies depend on the | ||||
| acceptability of false negatives on matching if there is low risk of | ||||
| false positives (see, for example, the discussion of false negatives | ||||
| in identifier comparison in Section 2.1 of RFC 6943 [RFC6943]). | ||||
| Aspects of that issue appear in, for example, RFC 3986 [RFC3986] and | ||||
| the PRECIS effort [PRECIS-Framework]. But, because the cases covered | ||||
| here are connected, not just to what the user sees but to what is | ||||
| typed and where, there is an increased risk of false positives | ||||
| (accidental as well as deliberate). | ||||
| [[CREF5: Note in Draft: The paragraph that follows was written for a | ||||
| much earlier version of this document. It is obsolete, but is being | ||||
| retained as a placeholder for future developments.]] | ||||
| This specification excludes a code point for which the Unicode- | This specification excludes a code point for which the Unicode- | |||
| specified normalization behavior could result in two ways to form a | specified normalization behavior could result in two ways to form a | |||
| visually-identical character within the same script not comparing | visually-identical character within the same script not comparing | |||
| equal. That behavior could create a dream case for someone intending | equal. That behavior could create a dream case for someone intending | |||
| to confuse the user by use of a domain name that looked identical to | to confuse the user by use of a domain name that looked identical to | |||
| another one, was entirely in the same script, but was still | another one, was entirely in the same script, but was still | |||
| considered different (see, for example, the discussion of false | considered different. | |||
| negatives in identifier comparison in Section 2.1 of RFC 6943 | ||||
| [RFC6943]). This exclusion therefore should improve Internet | ||||
| security. | ||||
| 8. References | Internet Security in areas that involve internationalized identifiers | |||
| that might contain the relevant characters is therefore significantly | ||||
| dependent on some effective resolution for the issues identified in | ||||
| this document, not just hand waving, devout wishes, or appointment of | ||||
| study committees about it. | ||||
| 8.1. Normative References | 10. References | |||
| 10.1. Normative References | ||||
| [PRECIS-Framework] | ||||
| Saint-Andre, P. and M. Blanchet, "PRECIS Framework: | ||||
| Preparation, Enforcement, and Comparison of | ||||
| Internationalized Strings in Application Protocols", | ||||
| February 2015, <https://datatracker.ietf.org/doc/draft- | ||||
| ietf-precis-framework/>. | ||||
| [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", BCP | [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", BCP | |||
| 137, RFC 5137, February 2008. | 137, RFC 5137, February 2008. | |||
| [RFC5890] Klensin, J., "Internationalized Domain Names for | [RFC5890] Klensin, J., "Internationalized Domain Names for | |||
| Applications (IDNA): Definitions and Document Framework", | Applications (IDNA): Definitions and Document Framework", | |||
| RFC 5890, August 2010. | RFC 5890, August 2010. | |||
| [RFC5892] Faltstrom, P., "The Unicode Code Points and | [RFC5892] Faltstrom, P., "The Unicode Code Points and | |||
| Internationalized Domain Names for Applications (IDNA)", | Internationalized Domain Names for Applications (IDNA)", | |||
| skipping to change at page 14, line 5 ¶ | skipping to change at page 25, line 35 ¶ | |||
| [UAX15-Exclusion] | [UAX15-Exclusion] | |||
| "Unicode Standard Annex #15: ob. cit., Section 5", | "Unicode Standard Annex #15: ob. cit., Section 5", | |||
| <http://www.unicode.org/reports/ | <http://www.unicode.org/reports/ | |||
| tr15/#Primary_Exclusion_List_Table>. | tr15/#Primary_Exclusion_List_Table>. | |||
| [UAX15-Versioning] | [UAX15-Versioning] | |||
| "Unicode Standard Annex #15, ob. cit., Section 3", | "Unicode Standard Annex #15, ob. cit., Section 3", | |||
| <http://www.unicode.org/reports/tr15/#Versioning>. | <http://www.unicode.org/reports/tr15/#Versioning>. | |||
| [UTS46] Davis, M. and M. Suignard, "Unicode Technical Standard | ||||
| #46: Unicode IDNA Compatibility Processing", Version | ||||
| 7.0.0, June 2014, <http://unicode.org/reports/tr46/>. | ||||
| [Unicod70-CompatDecomp] | ||||
| "The Unicode Standard, Version 7.0.0, ob.cit., Chapter | ||||
| 2.3: Compatibility Characters", Chapter 2, 2014, | ||||
| <http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>. | ||||
| Subsection titled "Compatibility Decomposable Characters" | ||||
| starting on page 26. | ||||
| [Unicod70-Overlay] | ||||
| "The Unicode Standard, Version 7.0.0, ob.cit., Chapter | ||||
| 2.2: Unicode Design Principles", Chapter 2, 2014, | ||||
| <http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>. | ||||
| Subsection titled "Non-decomposition of Overlaid | ||||
| Diacritics" starting on page 64. | ||||
| [Unicode5] | [Unicode5] | |||
| The Unicode Consortium, "The Unicode Standard, Version | The Unicode Consortium, "The Unicode Standard, Version | |||
| 5.0", ISBN 0-321-48091-0, 2007. | 5.0", ISBN 0-321-48091-0, 2007. | |||
| Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0. | Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0. | |||
| This printed reference has now been updated online to | This printed reference has now been updated online to | |||
| reflect additional code points. For code points, the | reflect additional code points. For code points, the | |||
| reference at the time RFC 5890-5894 were published is to | reference at the time RFC 5890-5894 were published is to | |||
| Unicode 5.2. | Unicode 5.2. | |||
| [Unicode62] | [Unicode62] | |||
| The Unicode Consortium, "The Unicode Standard, Version | The Unicode Consortium, "The Unicode Standard, Version | |||
| 6.2.0", ISBN 978-1-936213-07-8, 2012, | 6.2.0", ISBN 978-1-936213-07-8, 2012, | |||
| <http://www.unicode.org/versions/Unicode6.2.0/>. | <http://www.unicode.org/versions/Unicode6.2.0/>. | |||
| Preferred citation: The Unicode Consortium. The Unicode | Preferred citation: The Unicode Consortium. The Unicode | |||
| Standard, Version 6.2.0, (Mountain View, CA: The Unicode | Standard, Version 6.2.0, (Mountain View, CA: The Unicode | |||
| Consortium, 2012. ISBN 978-1-936213-07-8) | Consortium, 2012. ISBN 978-1-936213-07-8) | |||
| [Unicode62-Arabic] | ||||
| "The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8", | ||||
| Chapter 8, 2012, | ||||
| <http://www.unicode.org/versions/Unicode6.2.0/ch08.pdf>. | ||||
| Subsection titled "Encoding Principles", paragraph | ||||
| numbered 4, starting on page 251. | ||||
| [Unicode62-Hamza] | ||||
| "The Unicode Standard, Version 6.2.0, ob.cit., Chapter 8", | ||||
| Chapter 8, 2012, | ||||
| <http://www.unicode.org/versions/Unicode6.2.0/ch08.pdf>. | ||||
| Subsection titled "Combining Hamza Above" starting on page | ||||
| 263. | ||||
| [Unicode7] | [Unicode7] | |||
| The Unicode Consortium, "The Unicode Standard, Version | The Unicode Consortium, "The Unicode Standard, Version | |||
| 7.0.0", ISBN 978-1-936213-09-2, 2014, | 7.0.0", ISBN 978-1-936213-09-2, 2014, | |||
| <http://www.unicode.org/versions/Unicode7.0.0/>. | <http://www.unicode.org/versions/Unicode7.0.0/>. | |||
| Preferred Citation: The Unicode Consortium. The Unicode | Preferred Citation: The Unicode Consortium. The Unicode | |||
| Standard, Version 7.0.0, (Mountain View, CA: The Unicode | Standard, Version 7.0.0, (Mountain View, CA: The Unicode | |||
| Consortium, 2014. ISBN 978-1-936213-09-2) | Consortium, 2014. ISBN 978-1-936213-09-2) | |||
| 8.2. Informative References | [Unicode70-Arabic] | |||
| "The Unicode Standard, Version 7.0.0, ob.cit., Chapter | ||||
| 9.2: Arabic", Chapter 9, 2014, | ||||
| <http://www.unicode.org/versions/Unicode7.0.0/ch09.pdf>. | ||||
| Subsection titled "Encoding Principles", paragraph | ||||
| numbered 4, starting on page 362. | ||||
| [Unicode70-Design] | ||||
| "The Unicode Standard, Version 7.0.0, ob.cit., Chapter | ||||
| 2.2: Unicode Design Principles", Chapter 2, 2014, | ||||
| <http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>. | ||||
| [Unicode70-Hamza] | ||||
| "The Unicode Standard, Version 7.0.0, ob.cit., Chapter | ||||
| 9.2: Arabic", Chapter 9, 2014, | ||||
| <http://www.unicode.org/versions/Unicode7.0.0/ch09.pdf>. | ||||
| Subsection titled "Combining Hamza Above" starting on page | ||||
| 378. | ||||
| [Unicode70-Stability] | ||||
| "The Unicode Standard, Version 7.0.0, ob.cit., Chapter | ||||
| 2.2: Unicode Design Principles", Chapter 2, 2014, | ||||
| <http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf>. | ||||
| Subsection titled "Stability" starting on page 23 and | ||||
| containing a link to http://www.unicode.org/policies/ | ||||
| stability_policy.html.. | ||||
| 10.2. Informative References | ||||
| [Dalby] Dalby, A., "Dictionary of Languages: The definitive | ||||
| reference to more than 400 languages", Columbia Univeristy | ||||
| Press , 2004. | ||||
| pages 206-207 | ||||
| [Daniels] Daniels, P. and W. Bright, "The World's Writing Systems", | ||||
| Oxford University Press , 1986. | ||||
| [HomographAttack] | ||||
| Gabrilovich, E. and A. Gontmakher, "The Homograph Attack", | ||||
| Communications of the ACM 45(2):128, February 2002, | ||||
| <http://www.cs.technion.ac.il/~gabr/papers/ | ||||
| homograph_full.pdf>. | ||||
| [ICANN-VIP] | ||||
| ICANN, "The IDN Variant Issues Project: A Study of Issues | ||||
| Related to the Management of IDN Variant TLDs (Integrated | ||||
| Issues Report)", February 2012, | ||||
| <https://www.icann.org/en/system/files/files/idn-vip- | ||||
| integrated-issues-final-clean-20feb12-en.pdf>. | ||||
| [Omniglot-Fula] | ||||
| Ager, S., "Omniglot: Fula (Fulfulde, Pulaar, | ||||
| Pular'Fulaare)", | ||||
| <http://www.omniglot.com/writing/fula.htm>. | ||||
| Captured 2015-01-07 | ||||
| [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, | [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, | |||
| "Internationalizing Domain Names in Applications (IDNA)", | "Internationalizing Domain Names in Applications (IDNA)", | |||
| RFC 3490, March 2003. | RFC 3490, March 2003. | |||
| [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform | ||||
| Resource Identifier (URI): Generic Syntax", STD 66, RFC | ||||
| 3986, January 2005. | ||||
| [RFC4033] Arends, R., Austein, R., Larson, M., Massey, D., and S. | ||||
| Rose, "DNS Security Introduction and Requirements", RFC | ||||
| 4033, March 2005. | ||||
| [RFC5564] El-Sherbiny, A., Farah, M., Oueichek, I., and A. Al-Zoman, | ||||
| "Linguistic Guidelines for the Use of the Arabic Language | ||||
| in Internet Domains", RFC 5564, February 2010. | ||||
| [RFC6452] Faltstrom, P. and P. Hoffman, "The Unicode Code Points and | [RFC6452] Faltstrom, P. and P. Hoffman, "The Unicode Code Points and | |||
| Internationalized Domain Names for Applications (IDNA) - | Internationalized Domain Names for Applications (IDNA) - | |||
| Unicode 6.0", RFC 6452, November 2011. | Unicode 6.0", RFC 6452, November 2011. | |||
| [RFC6698] Hoffman, P. and J. Schlyter, "The DNS-Based Authentication | ||||
| of Named Entities (DANE) Transport Layer Security (TLS) | ||||
| Protocol: TLSA", RFC 6698, August 2012. | ||||
| [Unicode32] | [Unicode32] | |||
| The Unicode Consortium, "The Unicode Standard, Version | The Unicode Consortium, "The Unicode Standard, Version | |||
| 3.2.0", . | 3.2.0", . | |||
| The Unicode Standard, Version 3.2.0 is defined by The | The Unicode Standard, Version 3.2.0 is defined by The | |||
| Unicode Standard, Version 3.0 (Reading, MA, Addison- | Unicode Standard, Version 3.0 (Reading, MA, Addison- | |||
| Wesley, 2000. ISBN 0-201-61633-5), as amended by the | Wesley, 2000. ISBN 0-201-61633-5), as amended by the | |||
| Unicode Standard Annex #27: Unicode 3.1 | Unicode Standard Annex #27: Unicode 3.1 | |||
| (http://www.unicode.org/reports/tr27/) and by the Unicode | (http://www.unicode.org/reports/tr27/) and by the Unicode | |||
| Standard Annex #28: Unicode 3.2 | Standard Annex #28: Unicode 3.2 | |||
| skipping to change at page 15, line 48 ¶ | skipping to change at page 29, line 10 ¶ | |||
| A.2. Changes from version -01 to -02 | A.2. Changes from version -01 to -02 | |||
| Corrected a typographical error in which Hamza Above was incorrectly | Corrected a typographical error in which Hamza Above was incorrectly | |||
| listed with the wrong code point. | listed with the wrong code point. | |||
| A.3. Changes from version -02 to -03 | A.3. Changes from version -02 to -03 | |||
| Corrected a typographical error in the Abstract in which RFC 5892 was | Corrected a typographical error in the Abstract in which RFC 5892 was | |||
| incorrectly shown as 5982. | incorrectly shown as 5982. | |||
| A.4. Changes from version -03 to -04 | ||||
| o Explicitly identified the applicability of U+08A1 with Fula and | ||||
| added references that discuss that language and how it is written. | ||||
| o Updated several Unicode 6.2 references to point to Unicode 7.0 | ||||
| since the latter is now available in stable form (it was done when | ||||
| work on this I-D started). | ||||
| o Extensively revised to discuss the non-Arabic cases, non- | ||||
| decomposing diacritics, other types of characters that don't | ||||
| compare equal after normalization, and more general problem and | ||||
| approaches. | ||||
| Authors' Addresses | Authors' Addresses | |||
| John C Klensin | John C Klensin | |||
| 1770 Massachusetts Ave, Ste 322 | 1770 Massachusetts Ave, Ste 322 | |||
| Cambridge, MA 02140 | Cambridge, MA 02140 | |||
| USA | USA | |||
| Phone: +1 617 245 1457 | Phone: +1 617 245 1457 | |||
| Email: john-ietf@jck.com | Email: john-ietf@jck.com | |||
| Patrik Faltstrom | Patrik Faltstrom | |||
| Netnod | Netnod | |||
| End of changes. 55 change blocks. | ||||
| 183 lines changed or deleted | 827 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||