IDNABIS P. Resnick, Ed. Internet-Draft Qualcomm Incorporated Intended status: Standards TrackMay 25, 2009P. Hoffman Expires:November 26,January 4, 2010 VPN Consortium July 3, 2009 Mapping Characters in IDNAdraft-ietf-idnabis-mappings-00draft-ietf-idnabis-mappings-01 Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire onNovember 26, 2009.January 4, 2010. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents in effect on the date of publication of this document (http://trustee.ietf.org/license-info). Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Abstract In the original version of the Internationalized Domain Names in Applications (IDNA) protocol, any Unicode code points taken from user input were mapped into a set of Unicode code points that "make sense", which were then encoded and passed to the domain name system (DNS). The current version of IDNA presumes that the input to the protocol comes from a set of "permitted" code points, which it then encodes and passes to the DNS, but does not specify what to do with the result of user input. This documentspecifiesdescribes the actions taken by an implementation between user input and passing permitted code points to the new IDNA protocol.Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1. Requirements Language . . . . . . . . . . . . . . . . . . . 4 2. Architectural Principles . . . . . . . . . . . . . . . . . . . 4 3. The General Procedure . . . . . . . . . . . . . . . . . . . . . 6 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 6 5. Security Considerations . . . . . . . . . . . . . . . . . . . . 7 Appendix A. Backwards-compatible Mapping Algorithm . . . . . . . . 7 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . . 7 6. Normative References . . . . . . . . . . . . . . . . . . . . . 7 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 81. Introduction This documentspecifiesdescribes the operations thatapplications applycan be applied to user input in order to get it into a form acceptable by the Internationalized Domain Names in Applications (IDNA) protocol [I-D.ietf-idnabis-protocol]. The document describes the underlying architectural principlesthat underly this function in(in section2, describes a2 and the general implementation procedurethat an application SHOULD implement in(in section3, and specifies an algorithm and mapping that an application MAY implement in order to remain reasonably backward compatible with the original version of the IDNA protocol in appendix A.3). It should be noted that this documentis NOT specifyingdoes not specify the behavior of a protocol that appears "on the wire". Itspecifiesdescribes an operation that is to be applied to user input in order to prepare that user input for use in an "on the network" protocol. As unusual as this may be for an IETF protocol document, it is a necessary operation to maintain interoperability.1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].2. Architectural Principles An application that implements the IDNA protocol [I-D.ietf-idnabis-protocol]mustwill always takea set ofany user input and convertthat inputit to a set of Unicode code points. That user inputmightmay be acquired by any of several different input methods, all with differing conversion processes to be taken into consideration (e.g., typed on a keyboard, written by hand onto some sort of digitizer, spoken into a microphone and interpreted by a speech-to-text engine,or otherwise.etc.). The process of taking any particular user input and mapping it into a Unicode code point may be a simple one: If a user strikes the "A" key on a US English keyboard, without any modifiers such as the "Shift" key held down, in order to draw a Latin small letter A ("a"), many (perhaps most) modern operating system input methods will produce to the calling application the code point U+0061, encoded in a single octet. Sometimes the process is somewhat more complicated:Aa user might strike a particular set of keys to represent a combining macron followed by striking the "A" key in order to draw a Latin small letter A with a macron above it. Depending on the operating system, the input method chosen by the user, and even the parameters with which the application communicates with the input method, the result might be the code point U+0101 (encoded as two octets in UTF-8 or UTF-16, four octets in UTF-32, etc.), the code point U+0061 followed by the code point U+0304 (again, encoded in three or more octets, depending upon the encoding used) or even the code point U+FF41 followed by the code point U+0304 (and encoded in some form). And these examples leave aside the issue of operating systems and input methods that do not use Unicode code points for their character set. In every case, applications (with the help of the operating systems on which they run and the input methods used)MUSTneed to perform a mapping from user input into Unicode code points. The original version of the IDNA protocol [RFC3490] used a model whereby input was taken from the user, mapped (via whatever input method mechanisms were used) to a set of Unicode code points, and then further mapped to a set of Unicode code points using the Nameprep profile specified in [RFC3491]. In this procedure, there are two separate mapping steps: First, a mapping done by the input method (which might be controlled by the operating system, the application, or some combination) and then a second mapping performed by the Nameprep portion of the IDNA protocol. The mapping done in Nameprep includes a particular mapping table to re-map some characters to other characters, a particular normalization, and a set of prohibited characters. Note that the result of the two step mapping process means that the mapping chosen by the operating system or application in the first step might differ significantly from the mapping supplied by the Nameprep profile in the second step. This has advantages and disadvantages. Of course, the second mapping regularizes what gets looked up in the DNS, making for better interoperability between implementations which use the Nameprep mapping. However, the application or operating system may choose mappings in their input methods, which when passed through the second (Nameprep) mapping result in characters that are "surprising" to the end user. The other important feature of the original version of the IDNA protocol is that, with very few exceptions, it assumes that any set of Unicode code points provided to the Nameprep mapping can be mapped into a string of Unicode code points that are "sensible", even if that means mapping some code points to nothing (that is, removing the code points from the string). This allowed maximum flexibility in input strings. The present version of IDNA differs significantly in approach from the original version. First and foremost, it does not provide explicit mapping instructions. Instead, it assumes that the application (perhaps via an operating system input method) will do whatever mapping it requires to convert input into Unicode code points. This has the advantage of giving flexibility to the application to choose a mapping that is suitable for its user given specific user requirements, and avoids the two-step mapping of the original protocol. Instead of a mapping, the current version of IDNA provides a set of categories that can be used to specify the valid code points allowed in a domain name. In principle, an application ought to take user input of a domain name and convert it to the set of Unicode code points that represent the domain name the user_intends_.intends. As a practical matter, of course, determining userdesiresintent is a tricky business, so an application needs to choose a reasonable mapping from user input. That may differ based on the particular circumstances of a user, depending on locale, language, type of input method, etc. It is up to the application to make a reasonable choice.In the next section, this document specifies3. The General Procedure This section defines a general algorithm that applicationsSHOULDought to implement in order to produce Unicode code points that will be valid under the IDNA protocol.Then, in appendix A,An application might implement the full mapping as described below, or can choose a different mapping. In fact, an appliction might want to implement a full mappingis specifiedthat is substantially compatible with the original IDNAprotocol. An application MAY implementprotocol instead of thefull mapping or MAY choose a different mapping. 3. The General Procedurealgorithm given here. The general algorithm that an application (or the input method provided by an operating system)shouldought to use is relatively straightforward and generally follows section 5 of [I-D.ietf-idnabis-protocol]: 1. All characters are mapped using Unicode Normalization Form C (NFC).[Unicode51]2.Capital (upper case)Upper case characters are mapped to theirsmall (lower case) equivalents. [[anchor2: Need reference to "toLowerCase"]]lower case equivalents by using the algorithm for mapping Unicode characters. 3. Full-width and half-widthCJKcharacters (those defined with Decomposition Types <wide> and <narrow>) are mapped to theirequivalents. [[anchor3: Handwavingdecomposition mappings as shown in the Unicode character database. Definitions forhow that's supposedthe rules in this algorithm can be found in [Unicode51]. Specifically: o Unicode Normalization Form C can be found in Annex #15 of [Unicode51]. o In order tohappen]] Thesemap upper case characters to their lower case equivalents (defined in section 3.13 of [Unicode51]), first map characters to the "Lowercase_Mapping" property (the "<lower>" entry in the second column) in <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>, if any. Then, map characters to the "Simple_Lowercase_Mapping" property (the fourteenth column) in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>, if any. o In order to map full-width and half-width characters to their decomposition mappings, map any character whose "Decomposition_Type" (contained in the first part of of the sixth column) in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt> is either "<wide>" or "<narrow>" to the "Decomposition_Mapping" of that character (contained in the second part of the sixth column) in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>. o The <http://www.unicode.org/Public/UNIDATA/UCD.html> web page has useful descriptions of the contents of these files. If this mappings in this document are applied to versions of Unicode later than Unicode 5.1, the later versions of the Unicode Standard should be consulted. These are a minimal set of mappings that an applicationSHOULD do.should strongly consider doing. Of course, there are many others thatMAYmight be done.In particular, a mapping that in substantially compatible with [RFC3490] appears below in appendix A.4. IANA Considerations This memo includes no request to IANA. 5. Security ConsiderationsAppendix A. Backwards-compatible Mapping Algorithm The following mapping is mostly backwards-compatible with the original version of the IDNA protocol [RFC3490]. One important change isThis document suggests creating mappings thatthe original IDNA specification mappedmight cause confusion for somecharacters to nothing that the current IDNA specification permit. Those characters areusers while alleviating confusion in other users. Such confusion is notre-mappedcovered in any depth in thisalgorithm. [[anchor4: This is filler; needs to be completed.]] 1. Map using table B.1 and B.2 from [RFC3454]. 2. Normalize using Unicode Normalization Form KC. [Unicode51] 3. Prohibit using tables C.1.2, C.3, C.4, C.5, C.6, C.7, C.8, and C.9 from [RFC3454]. Appendix B. Acknowledgementsdocument (nor in the other IDNA-related documents). 6. Normative References [I-D.ietf-idnabis-protocol] Klensin, J., "Internationalized Domain Names in Applications (IDNA): Protocol", draft-ietf-idnabis-protocol-12 (work in progress), May 2009.[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of Internationalized Strings ("stringprep")", RFC 3454, December 2002.[RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, "Internationalizing Domain Names in Applications (IDNA)", RFC 3490, March 2003. [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)", RFC 3491, March 2003. [Unicode51] The Unicode Consortium, "The Unicode Standard, Version 5.1.0", 2008. defined by: The Unicode Standard, Version 5.0, Boston, MA, Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by Unicode 5.1.0(http://www.unicode.org/versions/Unicode5.1.0/). Author's Address(<http://www.unicode.org/versions/Unicode5.1.0/>). Authors' Addresses Peter W. Resnick (editor) Qualcomm Incorporated 5775 Morehouse Drive San Diego, CA 92121-1714 US Phone: +1 858 651 4478 Email: presnick@qualcomm.com URI: http://www.qualcomm.com/~presnick/ Paul Hoffman VPN Consortium 127 Segre Place Santa Cruz, CA 95060 US Phone: 1-831-426-9827 Email: paul.hoffman@vpnc.org