idnits 2.17.1 

draft-resman-idna2008-mappings-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (April 19, 2010) is 5121 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network Working Group                                         P. Resnick
3	Internet-Draft                                     Qualcomm Incorporated
4	Intended status: Informational                                P. Hoffman
5	Expires: October 21, 2010                                 VPN Consortium
6	                                                          April 19, 2010

8	                     Mapping Characters in IDNA2008
9	                   draft-resman-idna2008-mappings-01

11	Abstract

13	   In the original version of the Internationalized Domain Names in
14	   Applications (IDNA) protocol, any Unicode code points taken from user
15	   input were mapped into a set of Unicode code points that "made
16	   sense", and then encoded and passed to the domain name system (DNS).
17	   The IDNA2008 protocol presumes that the input to the protocol comes
18	   from a set of "permitted" code points, which it then encodes and
19	   passes to the DNS, but does not specify what to do with the result of
20	   user input.  This document describes the actions that can be taken by
21	   an implementation between user input and passing permitted code
22	   points to the new IDNA protocol.

24	Status of this Memo

26	   This Internet-Draft is submitted in full conformance with the
27	   provisions of BCP 78 and BCP 79.

29	   Internet-Drafts are working documents of the Internet Engineering
30	   Task Force (IETF).  Note that other groups may also distribute
31	   working documents as Internet-Drafts.  The list of current Internet-
32	   Drafts is at http://datatracker.ietf.org/drafts/current/.

34	   Internet-Drafts are draft documents valid for a maximum of six months
35	   and may be updated, replaced, or obsoleted by other documents at any
36	   time.  It is inappropriate to use Internet-Drafts as reference
37	   material or to cite them other than as "work in progress."

39	   This Internet-Draft will expire on October 21, 2010.

41	Copyright Notice

43	   Copyright (c) 2010 IETF Trust and the persons identified as the
44	   document authors.  All rights reserved.

46	   This document is subject to BCP 78 and the IETF Trust's Legal
47	   Provisions Relating to IETF Documents
48	   (http://trustee.ietf.org/license-info) in effect on the date of
49	   publication of this document.  Please review these documents
50	   carefully, as they describe your rights and restrictions with respect
51	   to this document.

53	1.  Introduction

55	   This document describes the operations that can be applied to user
56	   input in order to get it into a form acceptable by the
57	   Internationalized Domain Names in Applications (IDNA) protocol
58	   [IDNA2008protocol].  It includes a general implementation procedure
59	   for mapping.

61	   It should be noted that this document does not specify the behavior
62	   of a protocol that appears "on the wire".  It describes an operation
63	   that is to be applied to user input in order to prepare that user
64	   input for use in an "on the network" protocol.  As unusual as this
65	   may be for a document concerning Internet protocols, it is necessary
66	   to describe this operation for implementors who may have designed
67	   around the original IDNA protocol, which conflates this user input
68	   operation into the protocol.

70	   It is very important to note that there are many potential valid
71	   mappings of characters from user input.  The mapping described in
72	   this document is the basis for other mappings, and is not likely to
73	   be useful without modification.  Any useful mapping will have
74	   features designed to reduce the surprise for users and is likely to
75	   be slightly (or sometimes radically) different depending on the
76	   locale of the user, the type of input being used (such as typing,
77	   copy-and-paste, voice, and so on), the type application used, etc.
78	   Although most common mappings will probably produce similar results
79	   for the same input, there will be subtle differences between
80	   applications.

82	1.1.  The Dividing Line between User Interface and Protocol

84	   The user interface to applications is much more complicated than most
85	   network implementers think.  When we say "the user enters an
86	   internationalized domain name in the application", we are talking
87	   about a very complex process that encompasses everything from the
88	   user formulating the name and deciding which symbols to use to
89	   express that name, to the user entering the symbols into the computer
90	   using some input method (be it a keyboard, a stylus, or even a voice
91	   recognition program), to the computer interpreting that input (be it
92	   keyboard scan codes, a graphical representation, or digitized sounds)
93	   into some representation of those symbols, through finally
94	   normalizing those symbols into a particular character repertoire in
95	   an encoding recognizable to IDNA processes and the domain name
96	   system.

98	   Consideration for user interface for internationalized domain names
99	   involves taking into account culture, context, and locale for any
100	   given user.  A simple and well-known example is the lowercasing of
101	   the letter Latin capital letter I (U+0049) when it is used in the
102	   Turkish and other languages.  A capital "I" in Turkish is properly
103	   lowercased to a lowercase dotless "i" (U+0131), not to a Latin small
104	   letter i (U+0069).  This lowercasing is clearly dependent on the
105	   locale of the system and/or the locale of the user.  Using a single
106	   context-free mapping without considering the user interface
107	   properties has the potential of doing exactly the wrong thing for the
108	   user.

110	   The original version of IDNA conflated user interface processing and
111	   protocol.  It took whatever characters the user produced in whatever
112	   encoding the application used, assumed some conversion to Unicode
113	   code points, and then without regard to context, locale, or anything
114	   about the user's intentions, mapped them into a particular set of
115	   other characters, and then re-encoded them in Punycode, in order have
116	   the entire operation be contained within the protocol.  Ignoring
117	   context, locale, and user preference in the IDNA protocol made life
118	   significantly less complicated for the application developer, but at
119	   the expense of violating the principle of "least user surprise" for
120	   consumers and producers of domain names.

122	   In IDNA2008, the dividing line between "user interface" and
123	   "protocol" is clear.  The IDNA2008 specification defines the protocol
124	   part of IDNA: it explicitly does not deal with the user interface.
125	   Mappings such as the one described in this document explicitly deal
126	   with the user interface and not the protocol.  That is, a mapping is
127	   only to be applied before a string of characters is treated as a
128	   domain name (in the "user interface") and is never to be applied
129	   during domain name processing (in the "protocol").

131	1.2.  The Design of this Mapping

133	   The user interface mapping in this document is a set of expansions to
134	   IDNA2008 that are meant to be sensible and friendly and mostly
135	   obvious to people throughout the world when using typical
136	   applications with domain names that are entered by hand.  It is also
137	   designed to let applications be mostly backwards compatible with
138	   IDNA2003.  By definition, it cannot meet all of those design goals
139	   for all people, and in fact is known to fail on some of those goals
140	   for quite large populations of people.

142	   A good mapping in the real world might use the "sensible and friendly
143	   and mostly obvious" design goal but come up with a different
144	   algorithm.  Many algorithms will have results that are close to what
145	   is described here, but will differ in assumptions about the users'
146	   way of thinking or typing.  Having said that, it is likely that some
147	   mappings will be significantly different.  For example, a mapping
148	   might apply to a spoken user interface instead of a typed one.
149	   Another example is that a mapping might be different for users typing
150	   than for users using copy-and-paste from different applications.  Yet
151	   another example is that a user interface that allows typed input that
152	   is transliterated from Latin characters could have very different
153	   mappings than one that applies to typing in other character sets;
154	   this would be typical in a Pinyin input method for Chinese
155	   characters.

157	2.  The General Procedure

159	   This section defines a general algorithm that applications ought to
160	   implement in order to produce Unicode code points that will be valid
161	   under the IDNA protocol.  An application might implement the full
162	   mapping as described below, or can choose a different mapping.  This
163	   mapping is very general and was designed to be very acceptable to the
164	   widest user community, but as stated above, it does not take into
165	   account any particular context, culture, or locale.

167	   The general algorithm that an application (or the input method
168	   provided by an operating system) ought to use is relatively
169	   straightforward:

171	   1.  Upper case characters are mapped to their lower case equivalents
172	       by using the algorithm for mapping case in Unicode characters.
173	       This step was chosen because the output will behave more like
174	       ASCII host names behave.

176	   2.  Full-width and half-width characters (those defined with
177	       Decomposition Types <wide> and <narrow>) are mapped to their
178	       decomposition mappings as shown in the Unicode character
179	       database.  This step was chosen because many input mechanisms,
180	       particularly in Asia, do not allow you to easily enter characters
181	       in the form used by IDNA2008.  Even if they do allow the correct
182	       character form, the user might not know which form they are
183	       entering.

185	   3.  All characters are mapped using Unicode Normalization Form C
186	       (NFC).  This step was chosen because it maps combinations of
187	       combining characters into canonical composed form.  As with the
188	       full-width/half-width mapping, users are not generally aware of
189	       the particular form of characters that they are entering, and
190	       IDNA2008 requires that only the canonical composed forms from NFC
191	       are used.

193	   4.  [IDNA2008protocol] is specified such that the protocol acts on
194	       the individual labels of the domain name.  If an implementation
195	       of this mapping is also performing the step of separation of the
196	       parts of a domain name into labels by using the FULL STOP
197	       character (U+002E), the IDEOGRAPHIC FULL STOP (U+3002) character
198	       can be mapped to the FULL STOP before label separation occurs.
199	       There are other characters that are used as "full stops" that one
200	       could consider mapping as label separators, but their use as such
201	       has not been investigated thoroughly.  This step was chosen
202	       because some input mechanisms do not allow the user to easily
203	       enter proper label separators.  Only the IDEOGRAPHIC FULL STOP
204	       (U+3002) character is added in this mapping because the authors
205	       have not fully investigated the applicability of other characters
206	       and the environments where they should and should not be
207	       considered domain name label separators.

209	   Note that the steps above are ordered.

211	   Definitions for the rules in this algorithm can be found in
212	   [Unicode52].  Specifically:

214	   o  Unicode Normalization Form C can be found in Annex #15 of
215	      [Unicode52].

217	   o  In order to map upper case characters to their lower case
218	      equivalents (defined in section 3.13 of [Unicode52]), first map
219	      characters to the "Lowercase_Mapping" property (the "<lower>"
220	      entry in the second column) in
221	      <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>, if any.
222	      Then, map characters to the "Simple_Lowercase_Mapping" property
223	      (the fourteenth column) in
224	      <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>, if any.

226	   o  In order to map full-width and half-width characters to their
227	      decomposition mappings, map any character whose
228	      "Decomposition_Type" (contained in the first part of of the sixth
229	      column) in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>
230	      is either "<wide>" or "<narrow>" to the "Decomposition_Mapping" of
231	      that character (contained in the second part of the sixth column)
232	      in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>.

234	   o  The Unicode Character Database [TR44] has useful descriptions of
235	      the contents of these files.

237	   If the mappings in this document are applied to versions of Unicode
238	   later than Unicode 5.2, the later versions of the Unicode Standard
239	   should be consulted.

241	   These form a minimal set of mappings that an application should
242	   strongly consider doing.  Of course, there are many others that might
243	   be done.

245	3.  Implementing This Mapping

247	   If you are implementing a mapping for an application or operating
248	   system by using exactly the four steps in Section 2, the authors of
249	   this document have a request: please don't.  We mean it.  Section 2
250	   does not describe a universal mapping algorithm because, as we said,
251	   there is no universally-applicable mapping algorithm.

253	   If you read the material in Section 2 without reading Section 1, go
254	   back and carefully read all of Section 1; in many ways, Section 1 is
255	   more important than Section 2.  Further, you can probably think of
256	   user interface considerations that we did not list in Section 1.  If
257	   you did read Section 1 but somehow decided that the algorithm in
258	   Section 2 is completely correct for the intended users of your
259	   application or operating system, you are probably not thinking hard
260	   enough about your intended users.

262	4.  IANA Considerations

264	   This memo includes no request to IANA.

266	5.  Security Considerations

268	   This document suggests creating mappings that might cause confusion
269	   for some users while alleviating confusion in other users.  Such
270	   confusion is not covered in any depth in this document (nor in the
271	   other IDNA-related documents).

273	6.  Acknowledgements

275	   This document is the product of many contributions from numerous
276	   people in the IETF.

278	7.  Normative References

280	   [IDNA2008protocol]
281	              Klensin, J., "Internationalized Domain Names in
282	              Applications (IDNA): Protocol",
283	              draft-ietf-idnabis-protocol (work in progress),
284	              January 2010.

286	   [TR44]     The Unicode Consortium, "Unicode Character Database",
287	              Unicode Standard Annex 44, 2009.

289	   [Unicode52]
290	              The Unicode Consortium, "The Unicode Standard, Version
291	              5.2.0", 2009.

293	              defined by: The Unicode Standard, Version 5.0, Boston, MA,
294	              Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by
295	              Unicode 5.2.0
296	              (<http://www.unicode.org/versions/Unicode5.2.0/>).

298	Authors' Addresses

300	   Peter W. Resnick
301	   Qualcomm Incorporated
302	   5775 Morehouse Drive
303	   San Diego, CA  92121-1714
304	   US

306	   Phone: +1 858 651 4478
307	   Email: presnick@qualcomm.com
308	   URI:   http://www.qualcomm.com/~presnick/

310	   Paul Hoffman
311	   VPN Consortium
312	   127 Segre Place
313	   Santa Cruz, CA  95060
314	   US

316	   Phone: 1-831-426-9827
317	   Email: paul.hoffman@vpnc.org