idnits 2.17.1 draft-resman-idna2008-mappings-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 19, 2010) is 5121 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group P. Resnick 3 Internet-Draft Qualcomm Incorporated 4 Intended status: Informational P. Hoffman 5 Expires: October 21, 2010 VPN Consortium 6 April 19, 2010 8 Mapping Characters in IDNA2008 9 draft-resman-idna2008-mappings-01 11 Abstract 13 In the original version of the Internationalized Domain Names in 14 Applications (IDNA) protocol, any Unicode code points taken from user 15 input were mapped into a set of Unicode code points that "made 16 sense", and then encoded and passed to the domain name system (DNS). 17 The IDNA2008 protocol presumes that the input to the protocol comes 18 from a set of "permitted" code points, which it then encodes and 19 passes to the DNS, but does not specify what to do with the result of 20 user input. This document describes the actions that can be taken by 21 an implementation between user input and passing permitted code 22 points to the new IDNA protocol. 24 Status of this Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on October 21, 2010. 41 Copyright Notice 43 Copyright (c) 2010 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. 53 1. Introduction 55 This document describes the operations that can be applied to user 56 input in order to get it into a form acceptable by the 57 Internationalized Domain Names in Applications (IDNA) protocol 58 [IDNA2008protocol]. It includes a general implementation procedure 59 for mapping. 61 It should be noted that this document does not specify the behavior 62 of a protocol that appears "on the wire". It describes an operation 63 that is to be applied to user input in order to prepare that user 64 input for use in an "on the network" protocol. As unusual as this 65 may be for a document concerning Internet protocols, it is necessary 66 to describe this operation for implementors who may have designed 67 around the original IDNA protocol, which conflates this user input 68 operation into the protocol. 70 It is very important to note that there are many potential valid 71 mappings of characters from user input. The mapping described in 72 this document is the basis for other mappings, and is not likely to 73 be useful without modification. Any useful mapping will have 74 features designed to reduce the surprise for users and is likely to 75 be slightly (or sometimes radically) different depending on the 76 locale of the user, the type of input being used (such as typing, 77 copy-and-paste, voice, and so on), the type application used, etc. 78 Although most common mappings will probably produce similar results 79 for the same input, there will be subtle differences between 80 applications. 82 1.1. The Dividing Line between User Interface and Protocol 84 The user interface to applications is much more complicated than most 85 network implementers think. When we say "the user enters an 86 internationalized domain name in the application", we are talking 87 about a very complex process that encompasses everything from the 88 user formulating the name and deciding which symbols to use to 89 express that name, to the user entering the symbols into the computer 90 using some input method (be it a keyboard, a stylus, or even a voice 91 recognition program), to the computer interpreting that input (be it 92 keyboard scan codes, a graphical representation, or digitized sounds) 93 into some representation of those symbols, through finally 94 normalizing those symbols into a particular character repertoire in 95 an encoding recognizable to IDNA processes and the domain name 96 system. 98 Consideration for user interface for internationalized domain names 99 involves taking into account culture, context, and locale for any 100 given user. A simple and well-known example is the lowercasing of 101 the letter Latin capital letter I (U+0049) when it is used in the 102 Turkish and other languages. A capital "I" in Turkish is properly 103 lowercased to a lowercase dotless "i" (U+0131), not to a Latin small 104 letter i (U+0069). This lowercasing is clearly dependent on the 105 locale of the system and/or the locale of the user. Using a single 106 context-free mapping without considering the user interface 107 properties has the potential of doing exactly the wrong thing for the 108 user. 110 The original version of IDNA conflated user interface processing and 111 protocol. It took whatever characters the user produced in whatever 112 encoding the application used, assumed some conversion to Unicode 113 code points, and then without regard to context, locale, or anything 114 about the user's intentions, mapped them into a particular set of 115 other characters, and then re-encoded them in Punycode, in order have 116 the entire operation be contained within the protocol. Ignoring 117 context, locale, and user preference in the IDNA protocol made life 118 significantly less complicated for the application developer, but at 119 the expense of violating the principle of "least user surprise" for 120 consumers and producers of domain names. 122 In IDNA2008, the dividing line between "user interface" and 123 "protocol" is clear. The IDNA2008 specification defines the protocol 124 part of IDNA: it explicitly does not deal with the user interface. 125 Mappings such as the one described in this document explicitly deal 126 with the user interface and not the protocol. That is, a mapping is 127 only to be applied before a string of characters is treated as a 128 domain name (in the "user interface") and is never to be applied 129 during domain name processing (in the "protocol"). 131 1.2. The Design of this Mapping 133 The user interface mapping in this document is a set of expansions to 134 IDNA2008 that are meant to be sensible and friendly and mostly 135 obvious to people throughout the world when using typical 136 applications with domain names that are entered by hand. It is also 137 designed to let applications be mostly backwards compatible with 138 IDNA2003. By definition, it cannot meet all of those design goals 139 for all people, and in fact is known to fail on some of those goals 140 for quite large populations of people. 142 A good mapping in the real world might use the "sensible and friendly 143 and mostly obvious" design goal but come up with a different 144 algorithm. Many algorithms will have results that are close to what 145 is described here, but will differ in assumptions about the users' 146 way of thinking or typing. Having said that, it is likely that some 147 mappings will be significantly different. For example, a mapping 148 might apply to a spoken user interface instead of a typed one. 149 Another example is that a mapping might be different for users typing 150 than for users using copy-and-paste from different applications. Yet 151 another example is that a user interface that allows typed input that 152 is transliterated from Latin characters could have very different 153 mappings than one that applies to typing in other character sets; 154 this would be typical in a Pinyin input method for Chinese 155 characters. 157 2. The General Procedure 159 This section defines a general algorithm that applications ought to 160 implement in order to produce Unicode code points that will be valid 161 under the IDNA protocol. An application might implement the full 162 mapping as described below, or can choose a different mapping. This 163 mapping is very general and was designed to be very acceptable to the 164 widest user community, but as stated above, it does not take into 165 account any particular context, culture, or locale. 167 The general algorithm that an application (or the input method 168 provided by an operating system) ought to use is relatively 169 straightforward: 171 1. Upper case characters are mapped to their lower case equivalents 172 by using the algorithm for mapping case in Unicode characters. 173 This step was chosen because the output will behave more like 174 ASCII host names behave. 176 2. Full-width and half-width characters (those defined with 177 Decomposition Types and ) are mapped to their 178 decomposition mappings as shown in the Unicode character 179 database. This step was chosen because many input mechanisms, 180 particularly in Asia, do not allow you to easily enter characters 181 in the form used by IDNA2008. Even if they do allow the correct 182 character form, the user might not know which form they are 183 entering. 185 3. All characters are mapped using Unicode Normalization Form C 186 (NFC). This step was chosen because it maps combinations of 187 combining characters into canonical composed form. As with the 188 full-width/half-width mapping, users are not generally aware of 189 the particular form of characters that they are entering, and 190 IDNA2008 requires that only the canonical composed forms from NFC 191 are used. 193 4. [IDNA2008protocol] is specified such that the protocol acts on 194 the individual labels of the domain name. If an implementation 195 of this mapping is also performing the step of separation of the 196 parts of a domain name into labels by using the FULL STOP 197 character (U+002E), the IDEOGRAPHIC FULL STOP (U+3002) character 198 can be mapped to the FULL STOP before label separation occurs. 199 There are other characters that are used as "full stops" that one 200 could consider mapping as label separators, but their use as such 201 has not been investigated thoroughly. This step was chosen 202 because some input mechanisms do not allow the user to easily 203 enter proper label separators. Only the IDEOGRAPHIC FULL STOP 204 (U+3002) character is added in this mapping because the authors 205 have not fully investigated the applicability of other characters 206 and the environments where they should and should not be 207 considered domain name label separators. 209 Note that the steps above are ordered. 211 Definitions for the rules in this algorithm can be found in 212 [Unicode52]. Specifically: 214 o Unicode Normalization Form C can be found in Annex #15 of 215 [Unicode52]. 217 o In order to map upper case characters to their lower case 218 equivalents (defined in section 3.13 of [Unicode52]), first map 219 characters to the "Lowercase_Mapping" property (the "" 220 entry in the second column) in 221 , if any. 222 Then, map characters to the "Simple_Lowercase_Mapping" property 223 (the fourteenth column) in 224 , if any. 226 o In order to map full-width and half-width characters to their 227 decomposition mappings, map any character whose 228 "Decomposition_Type" (contained in the first part of of the sixth 229 column) in 230 is either "" or "" to the "Decomposition_Mapping" of 231 that character (contained in the second part of the sixth column) 232 in . 234 o The Unicode Character Database [TR44] has useful descriptions of 235 the contents of these files. 237 If the mappings in this document are applied to versions of Unicode 238 later than Unicode 5.2, the later versions of the Unicode Standard 239 should be consulted. 241 These form a minimal set of mappings that an application should 242 strongly consider doing. Of course, there are many others that might 243 be done. 245 3. Implementing This Mapping 247 If you are implementing a mapping for an application or operating 248 system by using exactly the four steps in Section 2, the authors of 249 this document have a request: please don't. We mean it. Section 2 250 does not describe a universal mapping algorithm because, as we said, 251 there is no universally-applicable mapping algorithm. 253 If you read the material in Section 2 without reading Section 1, go 254 back and carefully read all of Section 1; in many ways, Section 1 is 255 more important than Section 2. Further, you can probably think of 256 user interface considerations that we did not list in Section 1. If 257 you did read Section 1 but somehow decided that the algorithm in 258 Section 2 is completely correct for the intended users of your 259 application or operating system, you are probably not thinking hard 260 enough about your intended users. 262 4. IANA Considerations 264 This memo includes no request to IANA. 266 5. Security Considerations 268 This document suggests creating mappings that might cause confusion 269 for some users while alleviating confusion in other users. Such 270 confusion is not covered in any depth in this document (nor in the 271 other IDNA-related documents). 273 6. Acknowledgements 275 This document is the product of many contributions from numerous 276 people in the IETF. 278 7. Normative References 280 [IDNA2008protocol] 281 Klensin, J., "Internationalized Domain Names in 282 Applications (IDNA): Protocol", 283 draft-ietf-idnabis-protocol (work in progress), 284 January 2010. 286 [TR44] The Unicode Consortium, "Unicode Character Database", 287 Unicode Standard Annex 44, 2009. 289 [Unicode52] 290 The Unicode Consortium, "The Unicode Standard, Version 291 5.2.0", 2009. 293 defined by: The Unicode Standard, Version 5.0, Boston, MA, 294 Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by 295 Unicode 5.2.0 296 (). 298 Authors' Addresses 300 Peter W. Resnick 301 Qualcomm Incorporated 302 5775 Morehouse Drive 303 San Diego, CA 92121-1714 304 US 306 Phone: +1 858 651 4478 307 Email: presnick@qualcomm.com 308 URI: http://www.qualcomm.com/~presnick/ 310 Paul Hoffman 311 VPN Consortium 312 127 Segre Place 313 Santa Cruz, CA 95060 314 US 316 Phone: 1-831-426-9827 317 Email: paul.hoffman@vpnc.org