idnits 2.17.1
draft-klensin-idnabis-issues-05.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
** It looks like you're using RFC 3978 boilerplate. You should update this
to the boilerplate described in the IETF Trust License Policy document
(see https://trustee.ietf.org/license-info), which is required now.
-- Found old boilerplate from RFC 3978, Section 5.1 on line 15.
-- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
line 1865.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1876.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1883.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1889.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
== No 'Intended status' indicated for this document; assuming Proposed
Standard
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
No issues found here.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the IETF Trust Copyright Line does not match the
current year
== The document seems to lack the recommended RFC 2119 boilerplate, even if
it appears to use RFC 2119 keywords.
(The document does seem to have the reference to RFC 2119 which the
ID-Checklist requires).
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- The document date (November 18, 2007) is 6003 days in the past. Is this
intentional?
Checking references for intended status: Proposed Standard
----------------------------------------------------------------------------
(See RFCs 3967 and 4897 for information about using normative references
to lower-maturity documents in RFCs)
== Unused Reference: 'RFC2119' is defined on line 1752, but no explicit
reference was found in the text
== Unused Reference: 'Unicode32' is defined on line 1785, but no explicit
reference was found in the text
== Unused Reference: 'Unicode40' is defined on line 1796, but no explicit
reference was found in the text
== Unused Reference: 'RFC3986' is defined on line 1828, but no explicit
reference was found in the text
-- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII'
-- Possible downref: Non-RFC (?) normative reference: ref. 'IDNA200X-Bidi'
== Outdated reference: A later version (-05) exists of
draft-faltstrom-idnabis-tables-03
== Outdated reference: A later version (-04) exists of
draft-klensin-idnabis-protocol-01
-- Possible downref: Normative reference to a draft: ref.
'IDNA200X-protocol'
** Obsolete normative reference: RFC 3454 (Obsoleted by RFC 7564)
** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891)
** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891)
** Downref: Normative reference to an Informational RFC: RFC 3743
** Downref: Normative reference to an Informational RFC: RFC 4290
-- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode-UAX15'
-- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode32'
-- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode40'
-- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode50'
-- Obsolete informational reference (is this intentional?): RFC 810
(Obsoleted by RFC 952)
Summary: 6 errors (**), 0 flaws (~~), 9 warnings (==), 15 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group J. Klensin, Ed.
3 Internet-Draft November 18, 2007
4 Expires: May 21, 2008
6 Internationalizing Domain Names for Applications (IDNA): Issues and
7 Rationale
8 draft-klensin-idnabis-issues-05.txt
10 Status of this Memo
12 By submitting this Internet-Draft, each author represents that any
13 applicable patent or other IPR claims of which he or she is aware
14 have been or will be disclosed, and any of which he or she becomes
15 aware will be disclosed, in accordance with Section 6 of BCP 79.
17 Internet-Drafts are working documents of the Internet Engineering
18 Task Force (IETF), its areas, and its working groups. Note that
19 other groups may also distribute working documents as Internet-
20 Drafts.
22 Internet-Drafts are draft documents valid for a maximum of six months
23 and may be updated, replaced, or obsoleted by other documents at any
24 time. It is inappropriate to use Internet-Drafts as reference
25 material or to cite them other than as "work in progress."
27 The list of current Internet-Drafts can be accessed at
28 http://www.ietf.org/ietf/1id-abstracts.txt.
30 The list of Internet-Draft Shadow Directories can be accessed at
31 http://www.ietf.org/shadow.html.
33 This Internet-Draft will expire on May 21, 2008.
35 Copyright Notice
37 Copyright (C) The IETF Trust (2007).
39 Abstract
41 A recent IAB report identified issues that have been raised with
42 Internationalized Domain Names (IDNs). Some of these issues require
43 tuning of the existing protocols and the tables on which they depend.
44 Based on intensive discussion by an informal design team, this
45 document provides an overview some of the proposals that are being
46 made, provides explanatory material for them and then further
47 explains some of the issues that have been encountered.
49 Table of Contents
51 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
52 1.1. Context and Overview . . . . . . . . . . . . . . . . . . . 4
53 1.2. Discussion Forum . . . . . . . . . . . . . . . . . . . . . 4
54 1.3. Objectives . . . . . . . . . . . . . . . . . . . . . . . . 4
55 1.4. Applicability and Function of IDNA . . . . . . . . . . . . 5
56 1.5. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6
57 1.5.1. Documents and Standards . . . . . . . . . . . . . . . 6
58 1.5.2. Terminology about Characters and Character Sets . . . 6
59 1.5.3. DNS-related Terminology . . . . . . . . . . . . . . . 7
60 1.5.4. Terminology Specific to IDNA . . . . . . . . . . . . . 7
61 1.5.5. Punycode is an Algorithm, not a Name . . . . . . . . . 10
62 1.5.6. Other Terminology Issues . . . . . . . . . . . . . . . 10
63 2. The Original (2003) IDNA Model . . . . . . . . . . . . . . . . 11
64 2.1. Proposed label . . . . . . . . . . . . . . . . . . . . . . 12
65 2.2. Permitted Character Identification . . . . . . . . . . . . 12
66 2.3. Character Mappings . . . . . . . . . . . . . . . . . . . . 12
67 2.4. Registry Restrictions . . . . . . . . . . . . . . . . . . 12
68 2.5. Punycode Conversion . . . . . . . . . . . . . . . . . . . 13
69 2.6. Lookup or Insertion in the Zone . . . . . . . . . . . . . 13
70 3. A Revised IDNA Model . . . . . . . . . . . . . . . . . . . . . 13
71 3.1. Localization: The Role of the Local System and User
72 Interface . . . . . . . . . . . . . . . . . . . . . . . . 13
73 3.2. IDN Processing in the IDNA200x Model . . . . . . . . . . . 14
74 3.2.1. Summary of Effects . . . . . . . . . . . . . . . . . . 14
75 4. IDNA200x Document List . . . . . . . . . . . . . . . . . . . . 15
76 5. Permitted Characters: An Inclusion List . . . . . . . . . . . 15
77 5.1. A Tiered Model of Permitted Characters and Labels . . . . 15
78 5.1.1. ALWAYS . . . . . . . . . . . . . . . . . . . . . . . . 16
79 5.1.2. MAYBE . . . . . . . . . . . . . . . . . . . . . . . . 17
80 5.1.3. CONTEXTUAL RULE REQUIRED . . . . . . . . . . . . . . . 18
81 5.1.4. NEVER . . . . . . . . . . . . . . . . . . . . . . . . 18
82 5.2. Layered Restrictions: Tables, Context, Registration,
83 Applications . . . . . . . . . . . . . . . . . . . . . . . 19
84 5.3. A New Character List -- History . . . . . . . . . . . . . 19
85 5.4. Understanding New Issues and Constraints . . . . . . . . . 20
86 5.5. ALWAYS, MAYBE, and Contextual Rules . . . . . . . . . . . 20
87 6. Issues that Any Solution Must Address . . . . . . . . . . . . 21
88 6.1. Display and Network Order . . . . . . . . . . . . . . . . 21
89 6.2. Entry and Display in Applications . . . . . . . . . . . . 22
90 6.3. The Ligature and Digraph Problem . . . . . . . . . . . . . 23
91 6.4. Right-to-left Text . . . . . . . . . . . . . . . . . . . . 25
92 7. IDNs and the Robustness Principle . . . . . . . . . . . . . . 25
93 8. Migration and Version Synchronization . . . . . . . . . . . . 26
94 8.1. Design Criteria . . . . . . . . . . . . . . . . . . . . . 26
95 8.2. More Flexibility in User Agents . . . . . . . . . . . . . 29
96 8.3. The Question of Prefix Changes . . . . . . . . . . . . . . 31
97 8.3.1. Conditions requiring a prefix change . . . . . . . . . 31
98 8.3.2. Conditions not requiring a prefix change . . . . . . . 31
99 8.4. Stringprep Changes and Compatibility . . . . . . . . . . . 32
100 8.5. The Symbol Question . . . . . . . . . . . . . . . . . . . 33
101 8.6. Other Compatibility Issues . . . . . . . . . . . . . . . . 33
102 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 34
103 10. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 34
104 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 34
105 11.1. IDNA Permitted Character Registry . . . . . . . . . . . . 34
106 11.2. IDNA Context Registry . . . . . . . . . . . . . . . . . . 34
107 11.3. IANA Repository of TLD IDN Practices . . . . . . . . . . . 35
108 12. Security Considerations . . . . . . . . . . . . . . . . . . . 35
109 13. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 36
110 13.1. Version -01 . . . . . . . . . . . . . . . . . . . . . . . 36
111 13.2. Version -02 . . . . . . . . . . . . . . . . . . . . . . . 36
112 13.3. Version -03 . . . . . . . . . . . . . . . . . . . . . . . 37
113 13.4. Version -04 . . . . . . . . . . . . . . . . . . . . . . . 37
114 13.5. Version -05 . . . . . . . . . . . . . . . . . . . . . . . 37
115 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 37
116 14.1. Normative References . . . . . . . . . . . . . . . . . . . 37
117 14.2. Informative References . . . . . . . . . . . . . . . . . . 39
118 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 40
119 Intellectual Property and Copyright Statements . . . . . . . . . . 41
121 1. Introduction
123 1.1. Context and Overview
125 A recent IAB report [RFC4690] identified issues that have been raised
126 with Internationalized Domain Names (IDNs) and the associated
127 standards. Those standards are known as Internationalized Domain
128 Names in Applications (IDNA), taken from the name of the highest
129 level standard within that group (see Section 1.5). Based on
130 discussion of those issues and their impact, some of these standards
131 now require tuning the existing protocols and the tables on which
132 they depend. This document further explains, based on the results of
133 some intensive discussions by an informal design team, on a mailing
134 list, and in broader discussions, some of the issues that have been
135 encountered. It also provides an overview of the proposals that are
136 being made and explanatory material for them. Additional explanatory
137 material for other proposals will appear with the associated
138 documents.
140 This document begins with a discussion of the original and new IDNA
141 models and the general differences in strategy between the original
142 version of IDNA and the proposed new version. It continues with a
143 description of specific changes that are needed and issues that the
144 design must address, including some that were not explicitly
145 addressed in RFC 4690.
147 1.2. Discussion Forum
149 [[anchor4: RFC Editor: please remove this section.]]
151 This work is being discussed on the mailing list
152 idna-update@alvestrand.no
154 1.3. Objectives
156 The intent of the IDNA revision effort, and hence of this document
157 and the associated ones, is to increase the usability and
158 effectiveness of internationalized domain names (IDNs) while
159 preserving or strengthening the integrity of references that use
160 them. The original "hostname" (LDH) character definitions (see,
161 e.g., [RFC0810]) struck a balance between the creation of useful
162 mnemonics and the introduction of parsing problems or general
163 confusion in the contexts in which domain names are used. Our
164 objective is to preserve that balance while expanding the character
165 repertoire to include extended versions of Roman-derived scripts and
166 scripts that are not Roman in origin. No work of this sort will be
167 able to completely eliminate sources of visual or textual confusion:
168 such confusion exists even under the original rules. However, one
169 can hope, through the application of different techniques at
170 different points (see Section 5.2), to keep problems to an acceptable
171 minimum. One consequence of this general objective is that the
172 desire of some user or marketing community to use a particular string
173 --whether the reason is to try to write sentences of particular
174 languages in the DNS, to express a facsimile of the symbol for a
175 brand, or for some other purpose-- is not a primary goal or even a
176 particularly important one.
178 1.4. Applicability and Function of IDNA
180 The IDNA standard does not require any applications to conform to it,
181 nor does it retroactively change those applications. An application
182 can elect to use IDNA in order to support IDN while maintaining
183 interoperability with existing infrastructure. If an application
184 wants to use non-ASCII characters in domain names, IDNA is the only
185 currently-defined option. Adding IDNA support to an existing
186 application entails changes to the application only, and leaves room
187 for flexibility in the user interface.
189 A great deal of the discussion of IDN solutions has focused on
190 transition issues and how IDN will work in a world where not all of
191 the components have been updated. Proposals that were not chosen by
192 the original IDN Working Group would depend on user applications,
193 resolvers, and DNS servers being updated in order for a user to use
194 an internationalized domain name in any form or coding acceptable
195 under that method. While processing must be performed prior to or
196 after access to the DNS, no changes are needed to the DNS protocol or
197 any DNS servers or the resolvers on user's computers.
199 The IDNA specification solves the problem of extending the repertoire
200 of characters that can be used in domain names to include a large
201 subset of the Unicode repertoire.
203 IDNA does not extend the service offered by DNS to the applications.
204 Instead, the applications (and, by implication, the users) continue
205 to see an exact-match lookup service. Either there is a single
206 exactly-matching name or there is no match. This model has served
207 the existing applications well, but it requires, with or without
208 internationalized domain names, that users know the exact spelling of
209 the domain names that are to be typed into applications such as web
210 browsers and mail user agents. The introduction of the larger
211 repertoire of characters potentially makes the set of misspellings
212 larger, especially given that in some cases the same appearance, for
213 example on a business card, might visually match several Unicode code
214 points or several sequences of code points.
216 IDNA allows the graceful introduction of IDNs not only by avoiding
217 upgrades to existing infrastructure (such as DNS servers and mail
218 transport agents), but also by allowing some rudimentary use of IDNs
219 in applications by using the ASCII representation of the non-ASCII
220 name labels. While such names are user-unfriendly to read and type,
221 and hence not optimal for user input, they allow (for instance)
222 replying to email and clicking on URLs even though the domain name
223 displayed is incomprehensible to the user. In order to allow user-
224 friendly input and output of the IDNs, the applications need to be
225 modified to conform to this specification.
227 IDNA uses the Unicode character repertoire, which avoids the
228 significant delays that would be inherent in waiting for a different
229 and specific character set be defined for IDN purposes, presumably by
230 some other standards developing organization.
232 1.5. Terminology
234 1.5.1. Documents and Standards
236 This document uses the term "IDNA2003" to refer to the set of
237 standards that make up and support the version of IDNA published in
238 2003, i.e., those commonly known as the IDNA base specification
239 [RFC3490], Nameprep [RFC3491], Punycode [RFC3492], and Stringprep
240 [RFC3454]. In this document, those names are used to refer,
241 conceptually, to the individual documents, with the base IDNA
242 specification called just "IDNA".
244 The term "IDNA200x" is used to refer to a possible new version of
245 IDNA without specifying which particular documents would be affected.
246 While more common IETF usage might refer to the successor document(s)
247 as "IDNAbis", this document uses that term, and similar ones, to
248 refer to successors to the individual documents, e.g., "IDNAbis" is a
249 synonym for the specific successor to RFC3490, or "RFC3490bis". See
250 also Section 4.
252 1.5.2. Terminology about Characters and Character Sets
254 A code point is an integer value associated with a character in a
255 coded character set.
257 Unicode [Unicode50] is a coded character set containing tens of
258 thousands of characters. A single Unicode code point is denoted by
259 "U+" followed by four to six hexadecimal digits, while a range of
260 Unicode code points is denoted by two hexadecimal numbers separated
261 by "..", with no prefixes.
263 ASCII means US-ASCII [ASCII], a coded character set containing 128
264 characters associated with code points in the range 00..7F. Unicode
265 may be thought of as an extension of ASCII: it includes all the ASCII
266 characters and associates them with equivalent code points.
268 1.5.3. DNS-related Terminology
270 When discussing the DNS, this document generally assumes the
271 terminology used in the DNS specifications [RFC1034] [RFC1035]. The
272 terms "lookup" and "resolution" are used interchangeably and the
273 process or application component that performs DNS resolution is
274 called a "resolver". The process of placing an entry into the DNS is
275 referred to as "registration" paralleling common contemporary usage
276 in other contexts.
278 The term "LDH code points" is defined in this document to mean the
279 code points associated with ASCII letters, digits, and the hyphen-
280 minus; that is, U+002D, 30..39, 41..5A, and 61..7A. "LDH" is an
281 abbreviation for "letters, digits, hyphen".
283 The base DNS specifications [RFC1034] [RFC1035] discuss "domain
284 names" and "host names", but many people and sections of these
285 specifications use the terms interchangeably. Further, because those
286 documents were not terribly clear, many people who are sure they know
287 the exact definitions of each of these terms disagree on the
288 definitions. In this document the term "domain name" is used in
289 general. This document explicitly cites those documents whenever
290 referring to the host name syntax restrictions defined therein. The
291 remaining definitions in this subsection are essentially a review.
293 A label is an individual part of a domain name. Labels are usually
294 shown separated by dots; for example, the domain name
295 "www.example.com" is composed of three labels: "www", "example", and
296 "com". (The zero-length root label described in [RFC1123], which can
297 be explicit as in "www.example.com." or implicit as in
298 "www.example.com", is not considered a label in this specification.)
299 IDNA extends the set of usable characters in labels that are text.
300 For the rest of this document, the term "label" is shorthand for
301 "text label", and "every label" means "every text label".
303 1.5.4. Terminology Specific to IDNA
305 Some of the terminology used in describing IDNs in the IDNA2003
306 context has been a source of confusion. This section defines some
307 new terminology to reduce dependence on the problematic terms and
308 definitions that appears in RFC 3490.
310 1.5.4.1. Terms for IDN Label Codings
312 1.5.4.1.1. IDNA-valid strings, A-label, and U-label
314 To improve clarity, this document introduces three new terms. A
315 string is "IDNA-valid" if it meets all of the requirements of this
316 specification for an IDNA label. It may be either an "A-label" or a
317 "U-label", and it is expected that specific reference will be made to
318 the form appropriate to any context in which the distinction is
319 important. An "A-label" is the ASCII-Compatible Encoding (ACE) form
320 of an IDNA-valid string. It must be a complete label and valid as
321 the output of ToASCII, regardless of how it is actually produced.
322 This means, by definition, that every A-label will begin with the
323 IDNA ACE prefix, "xn--", followed by a string that is a valid output
324 of the Punycode algorithm and hence a maximum of 59 ASCII characters
325 in length. The prefix and string together must conform to all
326 requirements for a label that can be stored in the DNS including
327 conformance to the LDH rule. A "U-label" is an IDNA-valid string of
328 Unicode-coded characters that is a valid output of performing
329 ToUnicode on an A-label, again regardless of how the label is
330 actually produced. A Unicode string that cannot be generated by
331 decoding a valid A-label is not a valid U-label. [IDNA200X-protocol]
332 specifies the conversions between U-labels and A-labels.
334 Any rules or conventions that apply to DNS labels in general, such as
335 rules about lengths of strings, apply to whichever of the U-label or
336 A-label would be more restrictive. The exception to this, of course,
337 is that the restriction to ASCII characters does not apply to the
338 U-label.
340 A different way to look at these terms, which may be more clear to
341 some readers, is that U-labels, A-labels, and LDH-labels are disjoint
342 categories that, together, make up the forms of legitimate strings
343 for use in domain names that describe hosts. Of the three, only
344 A-labels and LDH-labels can actually appear in DNS zone files or
345 queries; U-labels can appear, along with those two, in presentation
346 and user interface forms and in selected protocols other than the DNS
347 ones themselves. Strings that do not conform to the rules for one of
348 these three categories and, in particular, strings that contain "-"
349 in the third or fourth character position but are
351 o not A-labels or
353 o that cannot be processed as U-labels or A-labels as described in
354 these specifications,
356 are invalid as labels in domain names that identify Internet hosts or
357 similar resources.
359 1.5.4.1.2. LDH-label and Internationalized Label
361 In the hope of further clarifying discussions about IDNs, this
362 document uses the term "LDH-label" strictly to refer to an all-ASCII
363 label that obeys the "hostname" (LDH) conventions and that is not an
364 IDN. In other words, the categories "U-label", "A-label", and "LDH-
365 label" are disjoint, with only the first two referring to IDNs. When
366 such a term is needed, an "internationalized label" is one that is a
367 member of the union of those three categories. There are some
368 standardized DNS label formats, such as those for service location
369 (SRV) records [RFC2782] that do not fall into any of the three
370 categories and hence are not internationalized labels.
372 1.5.4.2. Equivalence
374 In IDNA, equivalence of labels is defined in terms of the A-labels.
375 If the A-labels are equal in a case-independent comparison, then the
376 labels are considered equivalent, no matter how they are represented.
377 Traditional LDH labels already have a notion of equivalence: within
378 that list of characters, upper case and lower case are considered
379 equivalent. The IDNA notion of equivalence is an extension of that
380 older notion. Equivalent labels in IDNA are treated as alternate
381 forms of the same label, just as "foo" and "Foo" are treated as
382 alternate forms of the same label.
384 1.5.4.3. ACE prefix
386 The "ACE prefix" is defined in this document to be a string of ASCII
387 characters "xn--" that appears at the beginning of every A-label.
388 "ACE" stands for "ASCII-Compatible Encoding".
390 1.5.4.4. Domain name slot
392 A "domain name slot" is defined in this document to be a protocol
393 element or a function argument or a return value (and so on)
394 explicitly designated for carrying a domain name. Examples of domain
395 name slots include: the QNAME field of a DNS query; the name argument
396 of the gethostbyname() library function; the part of an email address
397 following the at-sign (@) in the From: field of an email message
398 header; and the host portion of the URI in the src attribute of an
399 HTML tag. General text that just happens to contain a domain
400 name is not a domain name slot. For example, a domain name appearing
401 in the plain text body of an email message is not occupying a domain
402 name slot.
404 An "IDN-aware domain name slot" is defined in this document to be a
405 domain name slot explicitly designated for carrying an
406 internationalized domain name as defined in this document. The
407 designation may be static (for example, in the specification of the
408 protocol or interface) or dynamic (for example, as a result of
409 negotiation in an interactive session).
411 An "IDN-unaware domain name slot" is defined in this document to be
412 any domain name slot that is not an IDN-aware domain name slot.
413 Obviously, this includes any domain name slot whose specification
414 predates IDNA.
416 1.5.5. Punycode is an Algorithm, not a Name
418 There has been some confusion about whether a "Punycode string" does
419 or does not include the prefix and about whether it is required that
420 such strings could have been the output of ToASCII (see RFC 3490,
421 Section 4 [RFC3490]). This specification discourages the use of the
422 term "Punycode" to describe anything but the encoding method and
423 algorithm of [RFC3492]. The terms defined above are preferred as
424 much more clear than terms such as "Punycode string".
426 1.5.6. Other Terminology Issues
428 The document departs from historical DNS terminology and usage in one
429 important respect. Over the years, the community has talked very
430 casually about "names" in the DNS, beginning with calling it "the
431 domain name system". That terminology is fine in the very precise
432 sense that the identifiers of the DNS do provide names for objects
433 and addresses. But, in the context of IDNs, the term has introduced
434 some confusion, confusion that has increased further as people have
435 begun to speak of DNS labels in terms of the words or phrases of
436 various natural languages.
438 Historically, many, perhaps most, of the "names" in the DNS have just
439 been mnemonics to identify some particular concept, object, or
440 organization. They are typically derived from, or rooted in, some
441 language because most people think in language-based ways. But,
442 because they are mnemonics, they need not obey the orthographic
443 conventions of any language: it is not a requirement that it be
444 possible for them to be "words".
446 This distinction is important because the reasonable goal of an IDN
447 effort is not to be able to write the great Klingon (or language of
448 one's choice) novel in DNS labels but to be able to form a usefully
449 broad range of mnemonics in ways that are as natural as possible in a
450 very broad range of scripts.
452 An "internationalized domain name" (IDN) is a domain name that may
453 contain one or more A-labels or U-labels, as appropriate, instead of
454 LDH labels. This implies that every conventional domain name is an
455 IDN (which implies that it is possible for a name to be an IDN
456 without it containing any non-ASCII characters). This document does
457 not attempt to define an "internationalized host name". Just as has
458 been the case with ASCII names, some DNS zone administrators may
459 impose restrictions, beyond those imposed by DNS or IDNA, on the
460 characters or strings that may be registered as labels in their
461 zones. Such restrictions have no effect on the syntax or semantics
462 of DNS protocol messages; a query for a name that matches no records
463 will yield the same response regardless of the reason why it is not
464 in the zone. Clients issuing queries or interpreting responses
465 cannot be assumed to have any knowledge of zone-specific restrictions
466 or conventions.
468 2. The Original (2003) IDNA Model
470 IDNA is a client-side protocol, i.e., almost all of the processing is
471 performed by the client. The strings that appear in, and are
472 resolved by, the DNS conform to the traditional rules for the naming
473 of hosts, and consist of ASCII letters, digits, and hyphens. This
474 approach permits IDNA to be deployed without modifications to the DNS
475 itself. That, in turn, avoids both having to upgrade the entire
476 Internet to support IDNs and needing to incur the unknown risks to
477 deployed systems of DNS structural or design changes especially if
478 those changes need to be deployed all at the same time.
480 This section contains a summary of the model underlying IDNA2003. It
481 is approximate and is not a substitute for reading and understanding
482 the actual specification document [RFC3490] and the documents on
483 which it depends. The summary is not intended to be completely
484 balanced. It emphasizes some characteristics of IDNA2003 that are
485 particularly important to understanding the nature of the proposed
486 changes.
488 The original IDNA specifications have the logical flow in domain name
489 registration and resolution outlined in the balance of this section.
490 They are not defined this way; instead, the steps are presented here
491 for convenience in comparison to what is being proposed in this
492 document and the associated ones. In particular, IDNA2003 does not
493 make as strong a distinction between procedures for registration and
494 those for resolution as the ones suggested in Section 3 and
495 Section 5.1.
497 The IDNA2003 specification explicitly includes the equivalents of the
498 steps in Section 2.2, Section 2.3, and Section 2.5 below. While the
499 other steps are present --either inside the protocol or presumed to
500 be performed before or after it-- they are not discussed explicitly.
501 That omission has been a source of confusion. Another source has
502 been definition of IDNA2003 as an algorithm, expressed partially in
503 prose and partially in pseudo code and tables. The steps below
504 follow the more traditional IETF practice: the functions are
505 specified, rather than the algorithms. The breakdown into steps is
506 for clarity of explanation; any implementation that produces the same
507 result with the same inputs is conforming.
509 2.1. Proposed label
511 The registrant submits a request for an IDN or the user attempts to
512 look up an IDN. The registrant or user typically produces the
513 request string by keyboard entry of a character sequence. That
514 sequence is validated only on the basis of its displayed appearance,
515 without knowledge of the character coding used for its internal
516 representation or other local details of the way the operating system
517 processes it. This string is converted to Unicode if necessary.
518 IDNA2003 assumes that the conversion is straightforward enough not to
519 be considered by the protocol.
521 2.2. Permitted Character Identification
523 The Unicode string is examined to prohibit characters that IDNA does
524 not permit in input. The list of excluded characters is quite
525 limited because IDNA2003 permits almost all Unicode characters to be
526 used as input, with many of them mapped into others.
528 2.3. Character Mappings
530 The label string is processed through the Nameprep [RFC3491] profile
531 of the Stringprep [RFC3454] tables and procedure. Among other
532 things, these procedures apply the Unicode normalization procedure
533 NFKC [Unicode-UAX15] which converts compatibility characters to their
534 base forms and resolves the different ways in which some characters
535 can be represented in Unicode into a canonical form. In IDNA2003,
536 one-way case mapping was also performed, partially simulating the
537 query-time folding operation that the DNS provides for ASCII strings.
539 2.4. Registry Restrictions
541 Registries at all levels of the DNS, not just the top level, are
542 expected to establish policies about the labels that may be
543 registered and for the processes associated with that action (see the
544 discussion of guidelines and statements in [RFC4690]). Such
545 restrictions have always existed in the DNS and have always been
546 applied at registration time, with the most notable example being
547 enforcement of the hostname (LDH) convention itself. For IDNs, the
548 restrictions to be applied are not an IETF matter except insofar as
549 they derive from restrictions imposed by application protocols (e.g.,
550 email has always required a more restricted syntax for domain names
551 than the restrictions of the DNS itself). Because these are
552 restrictions on what can be registered, it is not generally necessary
553 that they be global. If a name is not found on resolution, it is not
554 relevant whether it could have been registered; only that it was not
555 registered. Registry restrictions might include prohibition of
556 mixed-script labels or restrictions on labels permitted in a zone if
557 certain other labels are already present. The "variant" systems
558 discussed in [RFC3743] and [RFC4290] are examples of fairly
559 sophisticated registry restriction models. The various sets of ICANN
560 IDN Guidelines [ICANN-Guidelines] also suggest restrictions that
561 might sensibly be imposed.
563 The string produced by the above steps is checked and processed as
564 appropriate to local registry restrictions. Application of those
565 registry restrictions may result in the rejection of some labels or
566 the application of special restrictions to others.
568 2.5. Punycode Conversion
570 The resulting label (in Unicode code point character form) is
571 processed with the Punycode algorithm [RFC3492] and converted to a
572 form suitable for storage in the DNS (the "xn--..." form).
574 2.6. Lookup or Insertion in the Zone
576 For registration, the Punycode-encoded label is then placed in the
577 DNS by insertion into a zone. For lookup, that label is processed
578 according to normal DNS query procedures [RFC1035].
580 3. A Revised IDNA Model
582 One of the major goals of this work is to improve the general
583 understanding of how IDNA works and what characters are permitted and
584 what happens to them. Comprehensibility and predictability to users
585 and registrants are themselves important motivations and design goals
586 for this effort. The effort includes some new terminology and a
587 revised and extended model, both covered in this section, and some
588 more specific protocol, processing, and table modifications. Details
589 of the latter appear in other documents (see Section 4).
591 3.1. Localization: The Role of the Local System and User Interface
593 Several issues are inherent in the application of IDNs and, indeed,
594 almost any other system that tries to handle international characters
595 and concepts. They range from the apparently trivial --e.g., one
596 cannot display a character for which one does not have a font
597 available locally-- to the more complex and subtle. Many people have
598 observed that internationalization is just a tool to permit effective
599 localization while permitting some global uniformity. Issues of
600 display, of exactly how various strings and characters are entered,
601 and so on are inherently issues about localization and user interface
602 design.
604 A protocol such as IDNA can only assume that such operations as data
605 entry are possible. It may make some recommendations about how
606 display might work when characters and fonts are not available, but
607 they can only be general recommendations.
609 Operations for converting between local character sets and Unicode
610 are part of this general set of user interface issues. The
611 conversion is obviously not required at all in a Unicode-native
612 system where no conversion is required. It may, however, involve
613 some complexity in one that is not, especially if the elements of the
614 local character set do not map exactly and unambiguously into Unicode
615 characters and do so in a way that is completely stable over time.
616 Perhaps more important, if a label being converted to a local
617 character set contains Unicode characters that have no correspondence
618 in that character set, the application may have to apply special,
619 locally-appropriate, methods to avoid or reduce loss of information.
621 Depending on the system involved, the major difficulty may not lie in
622 the mapping but in accurately identifying the incoming character set
623 and then applying the correct conversion routine. It may be
624 especially difficult when the character coding system in local use is
625 based on conceptually different assumptions than those used by
626 Unicode about, e.g., how different presentation or combining forms
627 are handled. Those differences may not easily yield unambiguous
628 conversions or interpretations even if each coding system is
629 internally consistent and adequate to represent the local language
630 and script.
632 3.2. IDN Processing in the IDNA200x Model
634 [[anchor20: Placeholder ??? Do we need a summary of the two parts
635 here???]]
637 3.2.1. Summary of Effects
639 Separating Domain Name Registration and Resolution in the protocol
640 specification has one substantive impact. With IDNA2003, the tests
641 and steps made in these two parts of the protocol are essentially
642 identical. Separating them reflects current practice in which per-
643 registry restrictions and special processing are applied at
644 registration time but not on resolution. Even more important in the
645 longer term, it allows incremental addition of permitted character
646 groups to avoid freezing on one particular version of Unicode.
648 4. IDNA200x Document List
650 [[anchor22: This section will need to be extensively revised or
651 removed before publication.]]
653 The following documents are being produced as part of the IDNA200x
654 effort.
656 o A revised version of this document, containing an overview,
657 rationale, and conformance conditions.
659 o A separate document, drawn from material in early versions of this
660 one, that explicitly updates and replaces RFC 3490 but which has
661 most rationale material from that document moved to this one
662 [IDNA200X-protocol].
664 o A document describing the "Bidi problem" with Stringprep and
665 proposing a solution [IDNA200X-Bidi].
667 o A list of code points allowed in a U-label, based on Unicode 5.0
668 code assignments. See Section 5.
670 o One or more documents containing guidance and suggestions for
671 registries (in this context, those responsible for establishing
672 policies for any zone file in the DNS, not only those at the top
673 or second level). The documents in this category may not all be
674 IETF products and may be prepared and completed asynchronously
675 with those described above.
677 5. Permitted Characters: An Inclusion List
679 This section describes the model used to establish the algorithm and
680 character lists of [IDNA200X-Tables] and describes the names and
681 applicability of the categories used there. Note that the inclusion
682 of a character in one of the first three categories does not imply
683 that it can be used indiscriminately; some characters are associated
684 with contextual rules that must be applied as well.
686 5.1. A Tiered Model of Permitted Characters and Labels
688 Moving to an inclusion model requires a new list of characters that
689 are permitted in IDNs. In IDNA2003, the role and utility of
690 characters are independent of context and fixed forever. Making
691 those rules globally has proven impractical, partially because
692 handling of particular characters across the languages that use a
693 script, or the use of similar or identical-looking characters in
694 different scripts, are less well understood than many people believed
695 several years ago. Conversely, IDNA2003 prohibited some characters
696 entirely to avoid dealing with some of the issues discussed here --
697 restrictions that were much too severe for mnemonics based on some
698 languages.
700 Independently of the characters chosen (see next subsection), the
701 theory is to divide the characters that appear in Unicode into four
702 categories:
704 5.1.1. ALWAYS
706 Characters identified as "ALWAYS" are permitted for all uses in IDNs,
707 but may be associated with contextual restrictions (for example, any
708 character in this group that has a "right to left" property must be
709 used in context with the "Bidi" rules). The presence of a character
710 in this category implies that it has been examined and determined to
711 be appropriate for IDN use, and that it is well-understood that
712 contextual protocol restrictions in addition to those already
713 specified, such as rules about the use of given characters, are not
714 required. That, in turn, indicates that the script community
715 relevant to that character, reflecting appropriate authorities for
716 all of the known languages that use that script, has agreed that the
717 script and its components are sufficiently well understood. This
718 subsection discusses characters, rather than scripts, because it is
719 explicitly understood that a script community may decide to include
720 some characters of the script and not others.
722 Because of this condition, which requires evaluation by individual
723 script communities of the characters suitable for use in IDNs (not
724 just, e.g., the general stability of the scripts in which those
725 characters are embedded) it is not feasible to define the boundary
726 point between this category and the next one by general properties of
727 the characters, such as the Unicode property lists.
729 Despite its name, the presence of a character on this list does not
730 imply that a given registry need accept registrations containing any
731 of the characters in the category. Registries are still expected to
732 apply judgment about labels they will accept and to maintain rules
733 consistent with those judgments (see [IDNA200X-protocol] and
734 Section 5.2).
736 Characters that are placed in the "ALWAYS" category are never removed
737 from it unless the code points themselves are removed from Unicode (a
738 condition that may never occur).
740 5.1.2. MAYBE
742 Characters that are used to write the languages of the world and that
743 are thought of broadly as "letters" rather than, e.g., symbols or
744 punctuation, and that have not been placed in the "ALWAYS" or "NEVER"
745 categories (see Section 5.1.4 for the latter) belong to the "MAYBE"
746 category. As implied above, the collection of scripts and characters
747 in "MAYBE" has not yet been reviewed and finally approved by the
748 script community. It is possible that they may be appropriate for
749 general use only when special contextual rules (tests on the entire
750 label or on adjacent characters) are identified and specified.
752 In general and for maximum safety, registries SHOULD confine
753 themselves to characters from the "ALWAYS" category. However, if a
754 registry is permitting registrations only in a small number of
755 scripts the usage of which it is familiar with to develop rules that
756 are safe in its own environment -- it may be entirely appropriate for
757 it permit registrations that use characters from the "MAYBE"
758 categories as well as the "ALWAYS" one.
760 Applications are expected to not treat "ALWAYS" and "MAYBE"
761 differently with regard to name resolution ("lookup"). They may
762 choose to provide warnings to users when labels or fully-qualified
763 names containing characters in the "MAYBE" categories are to be
764 presented to users.
766 There are actually two subcategories of MAYBE. The assignment of a
767 character to one or the other represents an estimate of whether the
768 character will eventually be treated as "ALWAYS" or "NEVER" (some
769 characters may, however, remain in the "MAYBE" categories
770 indefinitely). Since the differences between the "MAYBE"
771 subcategories do not affect the protocol, characters may be moved
772 back and forth between them as information and knowledge accumulates.
774 5.1.2.1. Subcategory MAYBE YES
776 These are letter, digit, or letter-like characters that are generally
777 presumed to be appropriate in DNS labels, for which no specific in-
778 depth script or character evaluation has been performed. The risk
779 with characters in the "MAYBE YES" category is that it may later be
780 discovered that contextual rules are required for their safe use with
781 labels that otherwise contain characters from arbitrary scripts or
782 that the characters themselves may be problematic.
784 5.1.2.2. Subcategory MAYBE NO
786 These are characters that are not letter-like, but are not excluded
787 by some other rule. Given the general ban on characters other than
788 letters and digits, it is likely that they will be moved to "NEVER"
789 when their contexts are fully understood by the relevant community.
790 However, since characters once moved to "NEVER" cannot be moved back
791 out, conservatism about making that classification is in order.
793 5.1.3. CONTEXTUAL RULE REQUIRED
795 These characters are unsafe for general use in IDNs, typically
796 because they are invisible in most scripts but affect format or
797 presentation in a few others or because they are combining characters
798 that are safe for use only in conjunction with particular characters
799 or scripts. In order to permit them to be used at all, these
800 characters are assigned to the category "CONTEXTUAL RULE REQUIRED"
801 and, when adequately understood, associated with a rule. Examples of
802 typical rules include "Must follow a character from Script XYZ", "MAY
803 occur only if the entire label is in Script ABC", "MAY occur only if
804 the previous and subsequent characters have the DEF property".
806 Because it is easier to identify these characters than to know that
807 they are actually needed in IDNs or how to establish exactly the
808 right rules for each one, a character in the CONTEXTUAL RULE REQUIRED
809 category may have a null (missing) rule set in a given version of the
810 tables. Such characters MUST NOT appear in putative labels for
811 either registration or lookup. Of course, a later version of the
812 tables might contain a non-null rule.
814 If there is a rule, it MUST be evaluated and tested on registration
815 and SHOULD be evaluated and tested on lookup. If the test fails, the
816 label should not be processed for registration or lookup in the DNS.
818 5.1.4. NEVER
820 Some characters are sufficiently problematic for use in IDNs that
821 they should be excluded for both registration and lookup (i.e.,
822 conforming applications performing name resolution should verify that
823 these characters are absent; if they are present, the label strings
824 should be rejected rather than converted to A-labels and looked up.
826 Of course, this category includes code points that have been removed
827 entirely from Unicode should such characters ever occur.
829 Characters that are placed in the "NEVER" category are never removed
830 from it or reclassified. If a character is classified as "NEVER" in
831 error and the error is sufficiently problematic, the only recourse is
832 to introduce a new code point into Unicode and classify it as "MAYBE"
833 or "ALWAYS" as appropriate.
835 5.2. Layered Restrictions: Tables, Context, Registration, Applications
837 The essence of the character rules in IDNAbis is that there is no
838 magic bullet for any of the issues associated with a multiscript DNS.
839 Instead, we need to have a variety of approaches that, together,
840 constitute multiple lines of defense. The actual character tables
841 are the first mechanism, protocol rules about how those characters
842 are applied or restricted in context are the second, and those two in
843 combination constitute the limits of what can be done in a protocol
844 context. Registrars are expected to restrict what they permit to be
845 registered, devising and using rules that are designed to optimize
846 the balance between confusion and risk on the one hand and maximum
847 expressiveness in mnemonics on the other.
849 5.3. A New Character List -- History
851 [[anchor29: RFC Editor: please delete this subsection.]]
853 A preliminary version of a character list that reflects the above
854 categories has been was developed by the contributors to this
855 document [IDNA200X-Tables]. An earlier, initial, version was
856 developed by going through Unicode 5.0 one block and one character
857 class at a time and determining which characters, classes, and blocks
858 were clearly acceptable for IDNs, which one were clearly unacceptable
859 (e.g., all blocks consisting entirely of compatibility characters and
860 non-language symbols were excluded as were a number of character
861 classes), and which blocks and classes were in need of further study
862 or input from the relevant language communities. That effort was
863 successful, but not at the level of producing a directly-useful
864 character table. Additional iterations on the mailing list and with
865 UTC participation largely dropped the use of Unicode blocks and
866 focused on character classes, scripts, and properties together with
867 understandings gained from other Unicode Consortium efforts. Those
868 iterations have been more successful. The iterative process has led
869 to the conclusion that the best strategy is likely to be a mixed one
870 consisting of (i) classification into "ALWAYS" and "MAYBE YES" versus
871 "MAYBE NO" and "NEVER" based on Unicode properties and a few
872 exceptions and (ii) discrimination between "ALWAYS" and "MAYBE YES"
873 and between "MAYBE NO" and "NEVER" based on script community criteria
874 about IDN appropriateness will be needed. An alternative would
875 involve an entirely new property specifically associated with
876 appropriateness for IDN use, but it is not clear that is either
877 necessary or desirable.
879 5.4. Understanding New Issues and Constraints
881 The discussion in [IDNA200X-Bidi] illustrates some areas in which
882 more work and input is needed. Other issues are raised by the
883 Unicode "presentation form" model and, in particular, by the need for
884 zero-width characters in some limited cases to correctly designate
885 those forms and by some other issues with combining characters in
886 different contexts. It is expected that, once expert and materially-
887 concerned parties are identified to supply contextual rules, such
888 problems will be resolved quickly and the questioned collections of
889 characters either added to the list of permitted characters or
890 permanently excluded.
892 5.5. ALWAYS, MAYBE, and Contextual Rules
894 As discussed above, characters will be associated with the "ALWAYS"
895 or "MAYBE YES" properties if they can plausibly be used in an IDN.
896 They are classified as "MAYBE NO" if it appears unlikely that they
897 should be used in IDNs but there is uncertainty on that point. Non-
898 language characters and other character codes that can be identified
899 as globally inappropriate for IDNs, such as conventional spaces and
900 punctuation, will be assigned to "NEVER" (i.e., will never be
901 permitted in IDNs). A character associated with "CONTEXTUAL RULE
902 REQUIRED" is acceptable in a label if it is associated with the
903 identifier of a contextual rule set and the test implied by the rule
904 set is successful. If no such identifier is present in the version
905 of the tables in use, the character is treated as roughly equivalent
906 to "NEVER", i.e., it MUST NOT be used in either registration or
907 lookup with that version of the tables. Because a rule set
908 identifier may be installed in a later table version, this status is
909 obviously not permanent. This general approach could, obviously, be
910 implemented in several ways, not just by the exact arrangements
911 suggested above.
913 The property and rule sets are used as follows:
915 o Systems supporting domain name resolution SHOULD attempt to
916 resolve any label consisting entirely of characters that are in
917 the "ALWAYS" or "MAYBE" categories, including those that have not
918 been permanently excluded but that have not been classified with
919 regard to whether additional restrictions are needed, i.e., they
920 are categorized as "MAYBE YES" or "MAYBE NO". They MUST NOT
921 attempt to resolve label strings that contain unassigned character
922 positions or those that contain "NEVER" characters.
924 o Systems providing domain name registration functions MUST NOT
925 register any label that contains characters classified as "NEVER"
926 OR code point positions that are unassigned in the version of
927 Unicode they are using. If a character in a label has associated
928 contextual rules, they MUST NOT register the label unless the
929 conditions required by those rules are satisfied. They SHOULD NOT
930 register labels that contain a character assigned to a "MAYBE"
931 category.
933 A procedure for assigning rules to characters with the "MAYBE YES" or
934 "MAYBE NO" property, and for assigning (or not) the property to
935 characters assigned in future version of Unicode, is outlined under
936 Section 11. A key part of that procedure will be specifications that
937 make it possible to add new characters and blocks without long delays
938 in implementation. The procedure will result in an update to
939 existing IANA-maintained registries.
941 6. Issues that Any Solution Must Address
943 6.1. Display and Network Order
945 The correct treatment of domain names requires a clear distinction
946 between Network Order (the order in which the code points are sent in
947 protocols) and Display Order (the order in which the code points are
948 displayed on a screen or paper). The order of labels in a domain
949 name is discussed in [IDNA200X-Bidi]. There are, however, also
950 questions about the order in which labels are displayed if left-to-
951 right and right-to-left labels are adjacent to each other, especially
952 if there are also multiple consecutive appearances of one of the
953 types. The decision about the display order is ultimately under the
954 control of user agents --including web browsers, mail clients, and
955 the like-- which may be highly localized. Even when formats are
956 specified by protocols, the full composition of an Internationalized
957 Resource Identifier (IRI) [RFC3987] or Internationalized Email
958 address contains elements other than the domain name. For example,
959 IRIs contain protocol identifiers and field delimiter syntax such as
960 "http://" or "mailto:" while email addresses contain the "@" to
961 separate local parts from domain names. User agents are not required
962 to use those protocol-based forms directly but often do so. While
963 display, parsing, and processing within a label is specified by the
964 IDNA protocol and the associated documents, the relationship between
965 fully-qualified domain names and internationalized labels is
966 unchanged from the base DNS specifications. Comments here about such
967 full domain names are explanatory or examples of what might be done
968 and must not be considered normative.
970 Questions remain about protocol constraints implying that the overall
971 direction of these strings will always be left-to-right (or right-to-
972 left) for an IRI or email address, or if they even should conform to
973 such rules. These questions also have several possible answers.
975 Should a domain name abc.def, in which both labels are represented in
976 scripts that are written right-to-left, be displayed as fed.cba or
977 cba.fed? An IRI for clear text web access would, in network order,
978 begin with "http://" and the characters will appear as
979 "http://abc.def" -- but what does this suggest about the display
980 order? When entering a URI to many browsers, it may be possible to
981 provide only the domain name and leave the "http://" to be filled in
982 by default, assuming no tail (an approach that does not work for
983 other protocols). The natural display order for the typed domain
984 name on a right-to-left system is fed.cba. Does this change if a
985 protocol identifier, tail, and the corresponding delimiters are
986 specified?
988 While logic, precedent, and reality suggest that these are questions
989 for user interface design, not IETF protocol specifications,
990 experience in the 1980s and 1990s with mixing systems in which domain
991 name labels were read in network order (left-to-right) and those in
992 which those labels were read right-to-left would predict a great deal
993 of confusion, and heuristics that sometimes fail, if each
994 implementation of each application makes its own decisions on these
995 issues.
997 It should be obvious that any revision of IDNA must be more clear
998 about the distinction between network and display order for complete
999 (fully-qualified) domain names, as well as simply for individual
1000 labels, than the original specification was. It is likely that some
1001 strong suggestions should be made about display order as well.
1003 6.2. Entry and Display in Applications
1005 Applications can accept domain names using any character set or sets
1006 desired by the application developer, and can display domain names in
1007 any charset. That is, the IDNA protocol does not affect the
1008 interface between users and applications.
1010 An IDNA-aware application can accept and display internationalized
1011 domain names in two formats: the internationalized character set(s)
1012 supported by the application (i.e., an appropriate local
1013 representation of a U-label), and as an A-label. Applications MAY
1014 allow the display and user input of A-labels, but are not encouraged
1015 to do so except as an interface for special purposes, possibly for
1016 debugging, or to cope with display limitations. A-labels are opaque
1017 and ugly, and, where possible, should thus only be exposed to users
1018 who absolutely need them. Because IDN labels can be rendered either
1019 as the A-labels or U-labels, the application may reasonably have an
1020 option for the user to select the preferred method of display; if it
1021 does, rendering the U-label should normally be the default.
1023 Domain names are often stored and transported in many places. For
1024 example, they are part of documents such as mail messages and web
1025 pages. They are transported in many parts of many protocols, such as
1026 both the control commands and the RFC 2822 body parts of SMTP, and
1027 the headers and the body content in HTTP. It is important to
1028 remember that domain names appear both in domain name slots and in
1029 the content that is passed over protocols.
1031 In protocols and document formats that define how to handle
1032 specification or negotiation of charsets, labels can be encoded in
1033 any charset allowed by the protocol or document format. If a
1034 protocol or document format only allows one charset, the labels MUST
1035 be given in that charset. Of course, not all charsets can properly
1036 represent all labels. If a U-label cannot be displayed in its
1037 entirety, the only choice (without loss of information) may be to
1038 display the A-label.
1040 In any place where a protocol or document format allows transmission
1041 of the characters in internationalized labels, labels SHOULD be
1042 transmitted using whatever character encoding and escape mechanism
1043 the protocol or document format uses at that place.
1045 All protocols that use domain name slots already have the capacity
1046 for handling domain names in the ASCII charset. Thus, A-labels can
1047 inherently be handled by those protocols.
1049 6.3. The Ligature and Digraph Problem
1051 There are a number of languages written with alphabetic scripts in
1052 which single phonemes are written using two characters, termed a
1053 "digraph", for example, the "ph" in "pharmacy" and "telephone".
1054 (Note that characters paired in this manner can also appear
1055 consecutively without forming a digraph, as in "tophat".) Certain
1056 digraphs are normally indicated typographically by setting the two
1057 characters closer together than they would be if used consecutively
1058 to represent different phonemes. Some digraphs are fully joined as
1059 ligatures (strictly designating setting totally without intervening
1060 white space, although the term is sometimes applied to close set
1061 pairs). An example of this may be seen when the word "encyclopaedia"
1062 is set with a U+00E6 LATIN SMALL LIGATURE AE (and some would not
1063 consider that word correctly spelled unless the ligature form was
1064 used or the "a" was dropped entirely).
1066 Difficulties arise from the fact that a given ligature may be a
1067 completely optional typographic convenience for representing a
1068 digraph in one language (as in the above example with some spelling
1069 conventions), while in another language it is a single character that
1070 may not always be correctly representable by a two-letter sequence
1071 (as in the above example with different spelling conventions). This
1072 can be illustrated by many words in the Norwegian language, where the
1073 "ae" ligature is the 27th letter of a 29-letter extended Latin
1074 alphabet. It is equivalent to the 28th letter of the Swedish
1075 alphabet (also containing 29 letters), U+00E4 LATIN SMALL LETTER A
1076 WITH DIAERESIS, for which an "ae" cannot be substituted according to
1077 current orthographic standards.
1079 That character (U+00E4) is also part of the German alphabet where,
1080 unlike in the Nordic languages, the two-character sequence "ae" is
1081 usually treated as a fully acceptable alternate orthography. The
1082 inverse is however not true, and those two characters cannot
1083 necessarily be combined into an "umlauted a". This also applies to
1084 another German character, the "umlauted o" (U+00F6 LATIN SMALL LETTER
1085 O WITH DIAERESIS) which, for example, cannot be used for writing the
1086 name of the author "Goethe". It is also a letter in the Swedish
1087 alphabet where, in parallel to the "umlauted a", it cannot be
1088 correctly represented as "oe" and in the Norwegian alphabet, where it
1089 is represented, not as "umlauted o", but as "slashed o", U+00F8.
1091 Additional cases with alphabets written right-to-left are described
1092 in Section 6.4. This constitutes a problem that cannot be resolved
1093 solely by operating on scripts. It is, however, a key concern in the
1094 IDN context. Its satisfactory resolution will require support in
1095 policies set by registries, which therefore need to be particularly
1096 mindful not just of this specific issue, but of all other related
1097 matters that cannot be dealt with on an exclusively algorithmic
1098 basis.
1100 Just as with the examples of different-looking characters that may be
1101 assumed to be the same, it is in general impossible to deal with
1102 these situations in a system such as IDNA -- or with Unicode
1103 normalization generally -- since determining what to do requires
1104 information about the language being used, context, or both.
1105 Consequently, these specifications make no attempt to treat these
1106 combined characters in any special way. However, their existence
1107 provides a prime example of a situation in which a registry that is
1108 aware of the language context in which labels are to be registered,
1109 and where that language sometimes (or always) treats the two-
1110 character sequences as equivalent to the combined form, should give
1111 serious consideration to applying a "variant" model [RFC3743]
1112 [RFC4290] to reduce the opportunities for user confusion and fraud
1113 that would result from the related strings being registered to
1114 different parties.
1116 6.4. Right-to-left Text
1118 In order to be sure that the directionality of right-to-left text is
1119 unambiguous, IDNA2003 required that any label in which right-to-left
1120 characters appear both starts and ends with them, may not include any
1121 characters with strong left-to-right properties (which excludes other
1122 alphabetic characters but permits European digits), and rejects any
1123 other string that contains a right-to-left character. This is one of
1124 the few places where the IDNA algorithms (both old and new) are
1125 required to look at an entire label, not just at individual
1126 characters. Unfortunately, the algorithmic model used in IDNA2003
1127 fails when the final character in a right-to-left string requires a
1128 combining mark in order to be correctly represented. The mark will
1129 be the final code point in the string but is not identified with the
1130 right-to-left character attribute and Stringprep therefore rejects
1131 the string.
1133 This problem manifests itself in languages written with consonantal
1134 alphabets to which diacritical vocalic systems are applied, and in
1135 languages with orthographies derived from them where the combining
1136 marks may have different functionality. In both cases the combining
1137 marks can be essential components of the orthography. Examples of
1138 this are Yiddish, written with an extended Hebrew script, and Dhivehi
1139 (the official language of Maldives) which is written in the Thaana
1140 script (which is, in turn, derived from the Arabic script). Other
1141 languages are still being investigated, but the new rules for right
1142 to left scripts are described in [IDNA200X-Bidi].
1144 7. IDNs and the Robustness Principle
1146 The model of IDNs described in this document can be seen as a
1147 particular instance of the "Robustness Principle" that has been so
1148 important to other aspects of Internet protocol design. This
1149 principle is often stated as "Be conservative about what you send and
1150 liberal in what you accept" (See, e.g., RFC 1123, Section 1.2.2
1151 [RFC1123]). For IDNs to work well, registries must have or require
1152 sensible policies about what is registered -- conservative policies
1153 -- and implement and enforce them. Registries, registrars, or other
1154 actors who do not do so, or who get too liberal, too greedy, or too
1155 weird may deserve punishment that will primarily be meted out in the
1156 marketplace or by consumer protection rules and legislation. One can
1157 debate whether or not "punishment by browser vendor" is an effective
1158 marketplace tool, but it falls into the general category of
1159 approaches being discussed here. In any event, the Protocol Police
1160 (an important, although mythical, Internet mechanism for enforcing
1161 protocol conformance) are going to be worth about as much here as
1162 they usually are -- i.e., very little -- simply because, unlike the
1163 marketplace and legal and regulatory mechanisms, they have no
1164 enforcement power.
1166 Conversely, resolvers can (and SHOULD or maybe MUST) reject labels
1167 that clearly violate global (protocol) rules (no one has ever
1168 seriously claimed that being liberal in what is accepted requires
1169 being stupid). However, once one gets past such global rules and
1170 deals with anything sensitive to script or locale, it is necessary to
1171 assume that garbage has not been placed into the DNS, i.e., one must
1172 be liberal about what one is willing to look up in the DNS rather
1173 than guessing about whether it should have been permitted to be
1174 registered.
1176 As mentioned above, if a string doesn't resolve, it makes no
1177 difference whether it simply wasn't registered or was prohibited by
1178 some rule.
1180 If resolvers, as a user interface (UI) matter, decide to warn about
1181 some strings that are valid under the global rules but that they
1182 perceive as dangerous, that is their prerogative and we can only hope
1183 that the market (and maybe regulators) will reward the good choices
1184 and punish the bad ones. In this context, a resolver that decides a
1185 string that is valid under the protocol is dangerous and refuses to
1186 look it up is in violation of the protocols (if they are properly
1187 defined); one that is willing to look something up, but warns against
1188 it, is exercising a UI choice.
1190 8. Migration and Version Synchronization
1192 8.1. Design Criteria
1194 As mentioned above and in RFC 4690, two key goals of this work are to
1195 enable applications to be agnostic about whether they are being run
1196 in environments supporting any Unicode version from 3.2 onward and to
1197 permit incrementally adding permitted scripts and other character
1198 collections without disruption. The mechanisms that support this are
1199 outlined above, but this section reviews them in a context that may
1200 be more helpful to those who need to understand the approach and make
1201 plans for it.
1203 1. The general criteria for a putative label, and the collection of
1204 characters that make it up, to be considered IDNA-valid are:
1206 * The characters are "letters", numerals, or otherwise used to
1207 write words in some language. Symbols, drawing characters,
1208 and various notational characters are permanently excluded --
1209 some because they are actively dangerous in URI, IRI, or
1210 similar contexts and others because there is no evidence that
1211 they are important enough to Internet operations or
1212 internationalization to justify large numbers of special cases
1213 and character-specific handling (additional discussion and
1214 rationale for the symbol decision appears in Section 8.5). If
1215 strings are read out loud, rather than seen on paper, there
1216 are opportunities for considerable confusion between the name
1217 of a symbol (and a single symbol may have multiple names) and
1218 the symbol itself. Other than in very exceptional cases,
1219 e.g., where they are needed to write substantially any word of
1220 a given language, punctuation characters are excluded as well.
1221 The fact that a word exists is not proof that it should be
1222 usable in a DNS label and DNS labels are not expected to be
1223 usable for multiple-word phrases (although they are not
1224 prohibited if the conventions and orthography of a particular
1225 language cause that to be possible).
1227 * Characters that are unassigned in the version of Unicode being
1228 used by the registry or application are not permitted, even on
1229 resolution (lookup). This is because, unlike the conditions
1230 contemplated in IDNA2003 (except for right-to-left text), we
1231 now understand that tests involving the context of characters
1232 (e.g., some characters being permitted only adjacent to other
1233 ones of specific types) and integrity tests on complete labels
1234 will be needed. Unassigned code points cannot be permitted
1235 because one cannot determine the contextual rules that
1236 particular code points will require before characters are
1237 assigned to them and the properties of those characters fully
1238 understood.
1240 * Any character that is mapped to another character by
1241 Nameprep2003 or by a current version of NFKC is prohibited as
1242 input to IDNA (for either registration or resolution).
1243 Implementers of user interfaces to applications are free to
1244 make those conversions when they consider them suitable for
1245 their operating system environments, context, or users.
1247 Tables used to identify the characters that are IDNA-valid are
1248 expected to be driven by the principles above. The principles
1249 are not just an interpretation of the tables.
1251 2. For registration purposes, the collection of IDNA-valid
1252 characters will be a growing list. The conditions for entry to
1253 the list for a set of characters are (i) that they meet the
1254 conditions for IDNA-valid characters discussed immediately above
1255 and (ii) that consensus can be reached about usage and contextual
1256 rules. Because it is likely that such consensus cannot be
1257 reached immediately about the correct contextual rules for some
1258 characters -- e.g., the use of invisible ("zero-width")
1259 characters to modify presentation forms -- some sets of
1260 characters may be deferred from the IDNA-valid set even if they
1261 appear in a current version of Unicode. Of course, characters
1262 first assigned code points in later versions of Unicode would
1263 need to be introduced into IDNA only after those code points are
1264 assigned.
1266 3. Anyone entering a label into a DNS zone must properly validate
1267 that label -- i.e., be sure that the criteria for an A-label are
1268 met -- in order for Unicode version-independence to be possible.
1269 In particular:
1271 * Any label that contains hyphens as its third and fourth
1272 characters MUST be IDNA-valid. This implies that, (i) if the
1273 third and fourth characters are hyphens, the first and second
1274 ones MUST be "xn" until and unless this specification is
1275 updated to permit other prefixes and (ii) labels starting in
1276 "xn--" MUST be valid A-labels, as discussed in Section 3
1277 above.
1279 * The Unicode tables (i.e., tables of code points, character
1280 classes, and properties) and IDNA tables (i.e., tables of
1281 contextual rules such as those described above), MUST be
1282 consistent on the systems performing or validating labels to
1283 be registered. Note that this does not require that tables
1284 reflect the latest version of Unicode, only that all tables
1285 used on a given system are consistent with each other.
1287 Systems looking up or resolving DNS labels MUST be able to assume
1288 that those rules were followed.
1290 4. Anyone looking up a label in a DNS zone MUST
1292 * Maintain a consistent set of tables, as discussed above. As
1293 with registration, the tables need not reflect the latest
1294 version of Unicode but they MUST be consistent.
1296 * Validate labels to be looked up only to the extent of
1297 determining that the U-label does not contain either code
1298 points prohibited by IDNA (categorized as "NEVER") or code
1299 points that are unassigned in its version of Unicode. No
1300 attempt should be made to validate contextual rules about
1301 characters, including mixed-script label prohibitions,
1302 although such rules MAY be used to influence presentation
1303 decisions in the user interface.
1305 By avoiding applying its own interpretation of which labels are
1306 valid as a means of rejecting lookup attempts, the resolver
1307 application becomes less sensitive to version incompatibilities
1308 with the particular zone registry associated with the domain
1309 name.
1311 Under this model, a registry (or entity communicating with a registry
1312 to accomplish name registrations) will need to update its tables --
1313 both the Unicode-associated tables and the tables of permitted IDN
1314 characters -- to enable a new script or other set of new characters.
1315 It will not be affected by newer versions of Unicode, or newly-
1316 authorized characters, until and unless it wishes to make those
1317 registrations. The registration side is also responsible --under the
1318 protocol and to registrants and users-- for much more careful
1319 checking than is expected of applications systems that look names up,
1320 both checking as required by the protocol and checking required by
1321 whatever policies it develops for minimizing risks due to confusable
1322 characters and sequences and preserving language or script integrity.
1324 An application or client that looks names up in the DNS will be able
1325 to resolve any name that is registered, as long as its version of the
1326 Unicode-associated tables is sufficiently up-to-date to interpret all
1327 of the characters in the label. It SHOULD distinguish, in its
1328 messages to users, between "label contains an unallocated code point"
1329 and other types of lookup failures. A failure on the basis of an old
1330 version of Unicode may lead the user to a desire to upgrade to a
1331 newer version, but will have no other ill effects (this is consistent
1332 with behavior in the transition to the DNS when some hosts could not
1333 yet handle some forms of names or record types).
1335 8.2. More Flexibility in User Agents
1337 One key philosophical difference between IDNA2003 and this proposal
1338 is that the former provided mappings for many characters into others.
1339 These mappings were not reversible: the original string could not be
1340 recovered from the form stored in the DNS and, probably as a
1341 consequence, users became confused about what characters were valid
1342 for IDNs and which ones were not. Too many times, the answer to the
1343 question "can this character be used in an IDN" was "it depends on
1344 exactly what you mean by 'used'".
1346 IDNA200x does not perform these mappings but, instead, prohibits the
1347 characters that would be mapped to others. As examples, while
1348 mathematical characters based on Latin ones are accepted as input to
1349 IDNA2003, they are prohibited in IDNA200x. Similarly, double-width
1350 characters and other variations are prohibited as IDNA input.
1352 Since the rules in [IDNA200X-Tables] provide that only strings that
1353 are stable under NFKC are valid, if it is convenient for an
1354 application to perform NFKC normalization before lookup, that
1355 operation is safe since this will never make the application unable
1356 to look up any valid string.
1358 In many cases these prohibitions should have no effect on what the
1359 user can type at resolution time: it is perfectly reasonable for
1360 systems that support user interfaces at lookup time, to perform some
1361 character mapping that is appropriate to the local environment prior
1362 to actual invocation of IDNA as part of the Unicode conversions of
1363 [IDNA200X-protocol] above. However, those changes will be local ones
1364 only -- local to environments in which users will clearly understand
1365 that the character forms are equivalent. For use in interchange
1366 among systems, it appears to be much more important that U-labels and
1367 A-labels can be mapped back and forth without loss of information.
1369 One specific, and very important, instance of this change in strategy
1370 arises with case-folding. In the ASCII-only DNS, names are looked up
1371 and matched in a case-independent way, but no actual case-folding
1372 occurs. Names can be placed in the DNS in either upper or lower case
1373 form (or any mixture of them) and that form is preserved, returned in
1374 queries, and so on. IDNA2003 attempted to simulate that behavior by
1375 performing case-mapping at registration time (resulting in only
1376 lower-case IDNs in the DNS) and when names were looked up.
1378 As suggested earlier in this section, it appears to be desirable to
1379 do as little character mapping as possible consistent with having
1380 Unicode work correctly (e.g., NFC mapping to resolve different
1381 codings for the same character is still necessary) and to make the
1382 mapping between A-labels and U-labels idempotent. Case-mapping is
1383 not an exception to this principle. If only lower case characters
1384 can be registered in the DNS (i.e., present in a U-label), then
1385 IDNA200x should prohibit upper-case characters as input. Some other
1386 considerations reinforce this conclusion. For example, an essential
1387 element of the ASCII case-mapping functions is that
1388 uppercase(character) must be equal to
1389 uppercase(lowercase(character)). That requirement may not be
1390 satisfied with IDNs. The relationship between upper case and lower
1391 case may even be language-dependent, with different languages (or
1392 even the same language in different areas) using different mappings.
1393 Of course, the expectations of users who are accustomed to a case-
1394 insensitive DNS environment will probably be well-served if user
1395 agents perform case mapping prior to IDNA processing, but the IDNA
1396 procedures themselves should neither require such mapping nor expect
1397 it when it isn't natural to the localized environment.
1399 8.3. The Question of Prefix Changes
1401 The conditions that would require a change in the IDNA "prefix"
1402 ("xn--" for the version of IDNA specified in [RFC3490]) have been a
1403 great concern to the community. A prefix change would clearly be
1404 necessary if the algorithms were modified in a manner that would
1405 create serious ambiguities during subsequent transition in
1406 registrations. This section summarizes our conclusions about the
1407 conditions under which changes in prefix would be necessary.
1409 8.3.1. Conditions requiring a prefix change
1411 An IDN prefix change is needed if a given string would resolve or
1412 otherwise be interpreted differently depending on the version of the
1413 protocol or tables being used. Consequently, work to update IDNs
1414 would require a prefix change if, and only if, one of the following
1415 four conditions were met:
1417 1. The conversion of an A-label to Unicode (i.e., a U-label) yields
1418 one string under IDNA2003 (RFC3490) and a different string under
1419 IDNA200x.
1421 2. An input string that is valid under IDNA2003 and also valid under
1422 IDNA200x yields two different A-labels with the different
1423 versions of IDNA. This condition is believed to be essentially
1424 equivalent to the one above.
1426 Note, however, that if the input string is valid under one
1427 version and not valid under the other, this condition does not
1428 apply. See the first item in Section 8.3.2, below.
1430 3. A fundamental change is made to the semantics of the string that
1431 is inserted in the DNS, e.g., if a decision were made to try to
1432 include language or specific script information in that string,
1433 rather than having it be just a string of characters.
1435 4. A sufficiently large number of characters is added to Unicode so
1436 that the Punycode mechanism for block offsets no longer has
1437 enough capacity to reference the higher-numbered planes and
1438 blocks. This condition is unlikely even in the long term and
1439 certain not to arise in the next few years.
1441 8.3.2. Conditions not requiring a prefix change
1443 In particular, as a result of the principles described above, none of
1444 the following changes require a new prefix:
1446 1. Prohibition of some characters as input to IDNA. This may make
1447 names that are now registered inaccessible, but does not require
1448 a prefix change.
1450 2. Adjustments in Stringprep tables or IDNA actions, including
1451 normalization definitions, that do not affect characters that
1452 have already been invalid under IDNA2003.
1454 3. Changes in the style of definitions of Stringprep or Nameprep
1455 that do not alter the actions performed by them.
1457 8.4. Stringprep Changes and Compatibility
1459 Concerns have been expressed about problems for non-DNS uses of
1460 Stringprep being caused by changes to the specification intended to
1461 improve the handling of IDNs, most notably as this might affect
1462 identification and authentication protocols. Section 8.3, above,
1463 essentially also applies in this context. The proposed new inclusion
1464 tables [IDNA200X-Tables], the reduction in the number of characters
1465 permitted as input for registration or resolution (Section 5), and
1466 even the proposed changes in handling of right-to-left strings
1467 [IDNA200X-Bidi] either give interpretations to strings prohibited
1468 under IDNA2003 or prohibit strings that IDNA2003 permitted. Strings
1469 that are valid under both IDNA2003 and IDNA200x, and the
1470 corresponding versions of Stringprep, are not changed in
1471 interpretation. This protocol does not use either Nameprep or
1472 Stringprep as specified in IDNA2003.
1474 It is particularly important to keep IDNA processing separate from
1475 processing for various security protocols because some of the
1476 constraints that are necessary for smooth and comprehensible use of
1477 IDNs may be unwanted or undesirable in other contexts. For example,
1478 the criteria for good passwords or passphrases are very different
1479 from those for desirable IDNs. Similarly, internationalized SCSI
1480 identifiers and other protocol components are likely to have
1481 different requirements than IDNs.
1483 Perhaps even more important in practice, since most other known uses
1484 of Stringprep encode or process characters that are already in
1485 normalized form and expect the use of only those characters that can
1486 be used in writing words of languages, the changes proposed here and
1487 in [IDNA200X-Tables] are unlikely to have any effect at all,
1488 especially not on registries and registrations that follow rules
1489 already in existence when this work started.
1491 8.5. The Symbol Question
1493 [[anchor37: Move this material and integrate with the Symbol
1494 discussion above???]]
1496 One of the major differences between this specification and the
1497 original version of IDNA is that the original version permitted non-
1498 letter symbols of various sorts in the protocol. They were always
1499 discouraged in practice. In particular, both the "IESG Statement"
1500 about IDNA and all versions of the ICANN Guidelines specify that only
1501 language characters be used in labels. This specification bans the
1502 symbols entirely. There are several reasons for this, which include:
1504 o As discussed elsewhere, the original IDNA specification assumed
1505 that as many Unicode characters as possible should be permitted,
1506 directly or via mapping to other characters, in IDNs. This
1507 specification operates on an inclusion model, extrapolating from
1508 the LDH rules --which have served the Internet very well-- to a
1509 Unicode base rather than an ASCII base.
1511 o Unicode names for letters are fairly intuitive, recognizable to
1512 uses of the relevant script, and unambiguous. Symbol names are
1513 more problematic because there may be no general agreement on
1514 whether a particular glyph matches a symbol, there are no uniform
1515 conventions for naming, variations such as outline, solid, and
1516 shaded forms may or may not exist, and so on. As as result,
1517 symbols are a very poor basis for reliable communications. Of
1518 course, these difficulties with symbols do not arise with actual
1519 pictographic languages and scripts which would be treated like any
1520 other language characters; the two should not be confused.
1522 8.6. Other Compatibility Issues
1524 The existing (2003) IDNA model has several odd artifacts which occur
1525 largely by accident. Many, if not all, of these are potential
1526 avenues for exploits, especially if the registration process permits
1527 "source" names (names that have not been processed through IDNA and
1528 nameprep) to be registered. As one example, since the character
1529 Eszett, used in German, is mapped by IDNA2003 into the sequence "ss"
1530 rather than being retained as itself or prohibited, a string
1531 containing that character but otherwise in ASCII is not really an IDN
1532 (in the U-label sense defined above) at all. After Nameprep maps the
1533 Eszett out, the result is an ASCII string and so does not get an xn--
1534 prefix, but the string that can be displayed to a user appears to be
1535 an IDN. The proposed IDNA200x eliminates this artifact. A character
1536 is either permitted as itself or it is prohibited; special cases that
1537 make sense only in a particular linguistic or cultural context can be
1538 dealt with as localization matters where appropriate.
1540 9. Acknowledgments
1542 The editor and contributors would like to express their thanks to
1543 those who contributed significant early review comments, sometimes
1544 accompanied by text, especially Mark Davis, Paul Hoffman, Simon
1545 Josefsson, and Sam Weiler. In addition, some specific ideas were
1546 incorporated from suggestions, text, or comments about sections that
1547 were unclear supplied by Frank Ellerman, Michael Everson, Asmus
1548 Freytag, Michel Suignard, and Ken Whistler, although, as usual, they
1549 bear little or no responsibility for the conclusions the editor and
1550 contributors reached after receiving their suggestions. Thanks are
1551 also due to Vint Cerf, Debbie Garside, and Jefsey Morphin for
1552 conversations that led to considerable improvements in the content of
1553 this document.
1555 10. Contributors
1557 While the listed editor held the pen, this document represents the
1558 joint work and conclusions of an ad hoc design team consisting of the
1559 editor and, in alphabetic order, Harald Alvestrand, Tina Dam, Patrik
1560 Faltstrom, and Cary Karp. In addition, there were many specific
1561 contributions and helpful comments from those listed in the
1562 Acknowledgments section and others who have contributed to the
1563 development and use of the IDNA protocols.
1565 11. IANA Considerations
1567 11.1. IDNA Permitted Character Registry
1569 The distinction between "MAYBE" code points and those classified into
1570 "ALWAYS" and "NEVER" (see Section 5) requires a registry of
1571 characters and scripts and their categories. IANA is requested to
1572 establish that registry, using the "expert reviewer" model. Unlike
1573 usual practice, we recommend that the "expert reviewer" be a
1574 committee that reflects expertise on the relevant scripts, and
1575 encourage IANA, the IESG, and IAB to establish liaisons and work
1576 together with other relevant standards bodies to populate that
1577 committee and its procedures over the long term.
1579 11.2. IDNA Context Registry
1581 For characters that are defined in the permitted character as
1582 requiring a contextual rule, IANA will create and maintain a list of
1583 approved contextual rules, using the registration methods described
1584 above. IANA should develop a format for that registry, or a copy of
1585 it maintained in parallel, that is convenient for retrieval and
1586 machine processing and publish the location of that version.
1588 11.3. IANA Repository of TLD IDN Practices
1590 This registry is maintained by IANA at the request of ICANN, in
1591 conjunction with ICANN Guidelines for IDN use. It is not an IETF-
1592 managed registry and, while the protocol changes specified here may
1593 call for some revisions to the tables, these specifications have no
1594 effect on that registry and no IANA action is required as a result.
1596 12. Security Considerations
1598 Security on the Internet partly relies on the DNS. Thus, any change
1599 to the characteristics of the DNS can change the security of much of
1600 the Internet.
1602 Domain names are used by users to identify and connect to Internet
1603 servers. The security of the Internet is compromised if a user
1604 entering a single internationalized name is connected to different
1605 servers based on different interpretations of the internationalized
1606 domain name.
1608 When systems use local character sets other than ASCII and Unicode,
1609 this specification leaves the the problem of transcoding between the
1610 local character set and Unicode up to the application or local
1611 system. If different applications (or different versions of one
1612 application) implement different transcoding rules, they could
1613 interpret the same name differently and contact different servers.
1614 This problem is not solved by security protocols like TLS that do not
1615 take local character sets into account.
1617 To help prevent confusion between characters that are visually
1618 similar, it is suggested that implementations provide visual
1619 indications where a domain name contains multiple scripts. Such
1620 mechanisms can also be used to show when a name contains a mixture of
1621 simplified and traditional Chinese characters, or to distinguish zero
1622 and one from O and l. DNS zone adminstrators may impose restrictions
1623 (subject to the limitations identified elsewhere in this document)
1624 that try to minimize characters that have similar appearance or
1625 similar interpretations. It is worth noting that there are no
1626 comprehensive technical solutions to the problems of confusable
1627 characters. One can reduce the extent of the problems in various
1628 ways, but probably never eliminate it. Some specific suggestion
1629 about identification and handling of confusable characters appear in
1630 a Unicode Consortium publication [???]
1632 The registration and resolution models described above and in
1634 [IDNA200X-protocol] change the mechanisms available for applications
1635 and resolvers to determine the validity of labels they encounter. In
1636 some respects, the ability to test is strengthened. For example,
1637 putative labels that contain unassigned code points will now be
1638 rejected, while IDNA2003 permitted them (something that is now
1639 recognized as a considerable source of risk). On the other hand, the
1640 protocol specification no longer assumes that the application that
1641 looks up a name will be able to determine, and apply, information
1642 about the protocol version used in registration. In theory, that may
1643 increase risk since the application will be able to do less pre-
1644 lookup validation. In practice, the protection afforded by that test
1645 has been largely illusory for reasons explained in RFC 4690 and
1646 above.
1648 Any change to Stringprep or, more broadly, the IETF's model of the
1649 use of internationalized character strings in different protocols,
1650 creates some risk of inadvertent changes to those protocols,
1651 invalidating deployed applications or databases, and so on. Our
1652 current hypothesis is that the same considerations that would require
1653 changing the IDN prefix (see Section 8.3.2) are the ones that would,
1654 e.g., invalidate certificates or hashes that depend on Stringprep,
1655 but those cases require careful consideration and evaluation. More
1656 important, it is not necessary to change Stringprep2003 at all in
1657 order to make the IDNA changes contemplated here. It is far
1658 preferable to create a separate document, or separate profile
1659 components, for IDN work, leaving the question of upgrading to other
1660 protocols to experts on them and eliminating any possible
1661 synchronization dependency between IDNA changes and possible upgrades
1662 to security protocols or conventions.
1664 13. Change Log
1666 [[anchor44: RFC Editor: Please remove this section.]]
1668 13.1. Version -01
1670 Version -01 of this document is a considerable rewrite from -00.
1671 Many sections have been clarified or extended and several new
1672 sections have been added to reflect discussions in a number of
1673 contexts since -00 was issued.
1675 13.2. Version -02
1677 o Corrected several editorial errors including an accidentally-
1678 introduced misstatement about NFKC.
1680 o Extensively revised the document to synchronize its terminology
1681 with version 03 of [IDNA200X-Tables] and to provide a better
1682 conceptual framework for its categories and how they are used.
1683 Added new material to clarify terminology and relationships with
1684 other efforts. More subtle changes in this version lay the
1685 groundwork for separating the document into a conceptual overview
1686 and a protocol specification for version 03.
1688 13.3. Version -03
1690 o Removed protocol materials to a separate document and incorporated
1691 rationale and explanation materials from the original
1692 specification in RFC 3960 into this document. Cleaned up earlier
1693 text to reflect a more mature specification and restructured
1694 several sections and added additional rationale material.
1696 o Strengthened and clarified the A-label / U-label/ LDH-label
1697 definition.
1699 o Retitled the document to reflect its evolving role.
1701 13.4. Version -04
1703 o Moved more text from "protocol" and further reorganized material.
1705 o Provided new material on "Contextual Rule Required.
1707 o Improved consistency of terminology, both internally and with the
1708 "tables" document.
1710 o Improved the IANA Considerations section and discussed the
1711 existing IDNA-related registry.
1713 o More small changes to increase consistency.
1715 13.5. Version -05
1717 Changed "YES" category back to "ALWAYS" to re-synch with the tables
1718 document and provide clearer terminology.
1720 14. References
1722 14.1. Normative References
1724 [ASCII] American National Standards Institute (formerly United
1725 States of America Standards Institute), "USA Code for
1726 Information Interchange", ANSI X3.4-1968, 1968.
1728 ANSI X3.4-1968 has been replaced by newer versions with
1729 slight modifications, but the 1968 version remains
1730 definitive for the Internet.
1732 [IDNA200X-Bidi]
1733 Alvestrand, H. and C. Karp, "An IDNA problem in right-to-
1734 left scripts", July 2007, .
1737 [IDNA200X-Tables]
1738 Faltstrom, P., "The Unicode Codepoints and IDN",
1739 November 2007, .
1742 A version of this document, is available in HTML format at
1743 http://stupid.domain.name/idnabis/
1744 draft-faltstrom-idnabis-tables-03.txt
1746 [IDNA200X-protocol]
1747 Klensin, J., "Internationalizing Domain Names in
1748 Applications (IDNA): Protocol", November 2007, .
1752 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
1753 Requirement Levels", BCP 14, RFC 2119, March 1997.
1755 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
1756 Internationalized Strings ("stringprep")", RFC 3454,
1757 December 2002.
1759 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
1760 "Internationalizing Domain Names in Applications (IDNA)",
1761 RFC 3490, March 2003.
1763 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
1764 Profile for Internationalized Domain Names (IDN)",
1765 RFC 3491, March 2003.
1767 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode
1768 for Internationalized Domain Names in Applications
1769 (IDNA)", RFC 3492, March 2003.
1771 [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint
1772 Engineering Team (JET) Guidelines for Internationalized
1773 Domain Names (IDN) Registration and Administration for
1774 Chinese, Japanese, and Korean", RFC 3743, April 2004.
1776 [RFC4290] Klensin, J., "Suggested Practices for Registration of
1777 Internationalized Domain Names (IDN)", RFC 4290,
1778 December 2005.
1780 [Unicode-UAX15]
1781 The Unicode Consortium, "Unicode Standard Annex #15:
1782 Unicode Normalization Forms", 2006,
1783 .
1785 [Unicode32]
1786 The Unicode Consortium, "The Unicode Standard, Version
1787 3.0", 2000.
1789 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5).
1790 Version 3.2 consists of the definition in that book as
1791 amended by the Unicode Standard Annex #27: Unicode 3.1
1792 (http://www.unicode.org/reports/tr27/) and by the Unicode
1793 Standard Annex #28: Unicode 3.2
1794 (http://www.unicode.org/reports/tr28/).
1796 [Unicode40]
1797 The Unicode Consortium, "The Unicode Standard, Version
1798 4.0", 2003.
1800 [Unicode50]
1801 The Unicode Consortium, "The Unicode Standard, Version
1802 5.0", 2007.
1804 Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0
1806 14.2. Informative References
1808 [ICANN-Guidelines]
1809 ICANN, "IDN Implementation Guidelines", 2006,
1810 .
1812 [RFC0810] Feinler, E., Harrenstien, K., Su, Z., and V. White, "DoD
1813 Internet host table specification", RFC 810, March 1982.
1815 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
1816 STD 13, RFC 1034, November 1987.
1818 [RFC1035] Mockapetris, P., "Domain names - implementation and
1819 specification", STD 13, RFC 1035, November 1987.
1821 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application
1822 and Support", STD 3, RFC 1123, October 1989.
1824 [RFC2782] Gulbrandsen, A., Vixie, P., and L. Esibov, "A DNS RR for
1825 specifying the location of services (DNS SRV)", RFC 2782,
1826 February 2000.
1828 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
1829 Resource Identifier (URI): Generic Syntax", STD 66,
1830 RFC 3986, January 2005.
1832 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource
1833 Identifiers (IRIs)", RFC 3987, January 2005.
1835 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and
1836 Recommendations for Internationalized Domain Names
1837 (IDNs)", RFC 4690, September 2006.
1839 Author's Address
1841 John C Klensin (editor)
1842 1770 Massachusetts Ave, Ste 322
1843 Cambridge, MA 02140
1844 USA
1846 Phone: +1 617 245 1457
1847 Fax:
1848 Email: john+ietf@jck.com
1849 URI:
1851 Full Copyright Statement
1853 Copyright (C) The IETF Trust (2007).
1855 This document is subject to the rights, licenses and restrictions
1856 contained in BCP 78, and except as set forth therein, the authors
1857 retain all their rights.
1859 This document and the information contained herein are provided on an
1860 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1861 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
1862 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
1863 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
1864 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1865 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
1867 Intellectual Property
1869 The IETF takes no position regarding the validity or scope of any
1870 Intellectual Property Rights or other rights that might be claimed to
1871 pertain to the implementation or use of the technology described in
1872 this document or the extent to which any license under such rights
1873 might or might not be available; nor does it represent that it has
1874 made any independent effort to identify any such rights. Information
1875 on the procedures with respect to rights in RFC documents can be
1876 found in BCP 78 and BCP 79.
1878 Copies of IPR disclosures made to the IETF Secretariat and any
1879 assurances of licenses to be made available, or the result of an
1880 attempt made to obtain a general license or permission for the use of
1881 such proprietary rights by implementers or users of this
1882 specification can be obtained from the IETF on-line IPR repository at
1883 http://www.ietf.org/ipr.
1885 The IETF invites any interested party to bring to its attention any
1886 copyrights, patents or patent applications, or other proprietary
1887 rights that may cover technology that may be required to implement
1888 this standard. Please address the information to the IETF at
1889 ietf-ipr@ietf.org.
1891 Acknowledgment
1893 Funding for the RFC Editor function is provided by the IETF
1894 Administrative Support Activity (IASA).