idnits 2.17.1
draft-ietf-idnabis-rationale-01.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
** It looks like you're using RFC 3978 boilerplate. You should update this
to the boilerplate described in the IETF Trust License Policy document
(see https://trustee.ietf.org/license-info), which is required now.
-- Found old boilerplate from RFC 3978, Section 5.1 on line 16.
-- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
line 2276.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2287.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2294.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2300.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
No issues found here.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the IETF Trust Copyright Line does not match the
current year
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- The document date (July 12, 2008) is 5764 days in the past. Is this
intentional?
Checking references for intended status: Proposed Standard
----------------------------------------------------------------------------
(See RFCs 3967 and 4897 for information about using normative references
to lower-maturity documents in RFCs)
-- Possible downref: Non-RFC (?) normative reference: ref. 'ASCII'
-- Possible downref: Non-RFC (?) normative reference: ref. 'IDNA2008-Bidi'
== Outdated reference: A later version (-18) exists of
draft-ietf-idnabis-protocol-02
== Outdated reference: A later version (-09) exists of
draft-ietf-idnabis-tables-01
** Obsolete normative reference: RFC 3454 (Obsoleted by RFC 7564)
** Obsolete normative reference: RFC 3490 (Obsoleted by RFC 5890, RFC 5891)
** Obsolete normative reference: RFC 3491 (Obsoleted by RFC 5891)
** Obsolete normative reference: RFC 5226 (Obsoleted by RFC 8126)
== Outdated reference: A later version (-18) exists of
draft-ietf-idnabis-protocol-02
-- Duplicate reference: draft-ietf-idnabis-protocol, mentioned in
'RulesInit', was also mentioned in 'IDNA2008-Protocol'.
-- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode51'
-- Obsolete informational reference (is this intentional?): RFC 810
(Obsoleted by RFC 952)
Summary: 5 errors (**), 0 flaws (~~), 4 warnings (==), 12 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group J. Klensin
3 Internet-Draft July 12, 2008
4 Intended status: Standards Track
5 Expires: January 13, 2009
7 Internationalized Domain Names for Applications (IDNA): Definitions,
8 Background and Rationale
9 draft-ietf-idnabis-rationale-01.txt
11 Status of this Memo
13 By submitting this Internet-Draft, each author represents that any
14 applicable patent or other IPR claims of which he or she is aware
15 have been or will be disclosed, and any of which he or she becomes
16 aware will be disclosed, in accordance with Section 6 of BCP 79.
18 Internet-Drafts are working documents of the Internet Engineering
19 Task Force (IETF), its areas, and its working groups. Note that
20 other groups may also distribute working documents as Internet-
21 Drafts.
23 Internet-Drafts are draft documents valid for a maximum of six months
24 and may be updated, replaced, or obsoleted by other documents at any
25 time. It is inappropriate to use Internet-Drafts as reference
26 material or to cite them other than as "work in progress."
28 The list of current Internet-Drafts can be accessed at
29 http://www.ietf.org/ietf/1id-abstracts.txt.
31 The list of Internet-Draft Shadow Directories can be accessed at
32 http://www.ietf.org/shadow.html.
34 This Internet-Draft will expire on January 13, 2009.
36 Abstract
38 Several years have passed since the original protocol for
39 Internationalized Domain Names (IDNs) was completed and deployed.
40 During that time, a number of issues have arisen, including the need
41 to update the system to deal with newer versions of Unicode. Some of
42 these issues require tuning of the existing protocols and the tables
43 on which they depend. This document provides an overview of a
44 revised system and provides explanatory material for its components.
46 Table of Contents
48 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
49 1.1. Context and Overview . . . . . . . . . . . . . . . . . . . 4
50 1.2. Discussion Forum . . . . . . . . . . . . . . . . . . . . . 4
51 1.3. Objectives . . . . . . . . . . . . . . . . . . . . . . . . 4
52 1.4. Applicability and Function of IDNA . . . . . . . . . . . . 5
53 1.5. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6
54 1.5.1. Documents and Standards . . . . . . . . . . . . . . . 6
55 1.5.2. Terminology about Characters and Character Sets . . . 6
56 1.5.3. DNS-related Terminology . . . . . . . . . . . . . . . 7
57 1.5.4. Terminology Specific to IDNA . . . . . . . . . . . . . 7
58 1.5.5. Punycode is an Algorithm, not a Name . . . . . . . . . 10
59 1.5.6. Other Terminology Issues . . . . . . . . . . . . . . . 11
60 1.6. Comprehensibility of IDNA Mechanisms and Processing . . . 12
61 2. Summary of Major Changes from IDNA2003 . . . . . . . . . . . . 13
62 3. The Revised IDNA Model . . . . . . . . . . . . . . . . . . . . 14
63 4. Processing in IDNA2008 . . . . . . . . . . . . . . . . . . . . 14
64 5. IDNA2008 Document List . . . . . . . . . . . . . . . . . . . . 14
65 6. Permitted Characters: An Inclusion List . . . . . . . . . . . 15
66 6.1. A Tiered Model of Permitted Characters and Labels . . . . 15
67 6.1.1. PROTOCOL-VALID . . . . . . . . . . . . . . . . . . . . 16
68 6.1.2. DISALLOWED . . . . . . . . . . . . . . . . . . . . . . 17
69 6.1.3. UNASSIGNED . . . . . . . . . . . . . . . . . . . . . . 18
70 6.2. Registration Policy . . . . . . . . . . . . . . . . . . . 19
71 6.3. Layered Restrictions: Tables, Context, Registration,
72 Applications . . . . . . . . . . . . . . . . . . . . . . . 19
73 7. Issues that Constrain Possible Solutions . . . . . . . . . . . 19
74 7.1. Display and Network Order . . . . . . . . . . . . . . . . 19
75 7.2. Entry and Display in Applications . . . . . . . . . . . . 21
76 7.3. Linguistic Expectations: Ligatures, Digraphs, and
77 Alternate Character Forms . . . . . . . . . . . . . . . . 22
78 7.4. Case Mapping and Related Issues . . . . . . . . . . . . . 24
79 7.5. Right to Left Text . . . . . . . . . . . . . . . . . . . . 25
80 8. IDNs and the Robustness Principle . . . . . . . . . . . . . . 25
81 9. Front-end and User Interface Processing . . . . . . . . . . . 26
82 10. Migration and Version Synchronization . . . . . . . . . . . . 29
83 10.1. Design Criteria . . . . . . . . . . . . . . . . . . . . . 29
84 10.1.1. General IDNA Validity Criteria . . . . . . . . . . . . 29
85 10.1.2. Labels in Registration . . . . . . . . . . . . . . . . 30
86 10.1.3. Labels in Resolution (Lookup) . . . . . . . . . . . . 31
87 10.2. More Flexibility in User Agents . . . . . . . . . . . . . 32
88 10.3. The Question of Prefix Changes . . . . . . . . . . . . . . 33
89 10.3.1. Conditions Requiring a Prefix Change . . . . . . . . . 33
90 10.3.2. Conditions Not Requiring a Prefix Change . . . . . . . 34
91 10.3.3. Implications of Prefix Changes . . . . . . . . . . . . 35
92 10.4. Stringprep Changes and Compatibility . . . . . . . . . . . 35
93 10.5. The Symbol Question . . . . . . . . . . . . . . . . . . . 36
94 10.6. Migration Between Unicode Versions: Unassigned Code
95 Points . . . . . . . . . . . . . . . . . . . . . . . . . . 37
96 10.7. Other Compatibility Issues . . . . . . . . . . . . . . . . 38
97 11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 39
98 12. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 39
99 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 40
100 13.1. IDNA Character Registry . . . . . . . . . . . . . . . . . 40
101 13.2. IDNA Context Registry . . . . . . . . . . . . . . . . . . 40
102 13.3. IANA Repository of IDN Practices of TLDs . . . . . . . . . 40
103 14. Security Considerations . . . . . . . . . . . . . . . . . . . 41
104 15. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 42
105 15.1. Version -01 of draft-klensin-idnabis-issues . . . . . . . 42
106 15.2. Version -02 of draft-klensin-idnabis-issues . . . . . . . 42
107 15.3. Version -03 of draft-klensin-idnabis-issues . . . . . . . 43
108 15.4. Version -04 of draft-klensin-idnabis-issues . . . . . . . 43
109 15.5. Version -05 of draft-klensin-idnabis-issues . . . . . . . 43
110 15.6. Version -06 of draft-klensin-idnabis-issues . . . . . . . 43
111 15.7. Version -07 of draft-klensin-idnabis-issues . . . . . . . 44
112 15.8. Version -00 of draft-ietf-idnabis-rationale . . . . . . . 44
113 15.9. Version -01 of draft-ietf-idnabis-rationale . . . . . . . 45
114 16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 46
115 16.1. Normative References . . . . . . . . . . . . . . . . . . . 46
116 16.2. Informative References . . . . . . . . . . . . . . . . . . 47
117 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 48
118 Intellectual Property and Copyright Statements . . . . . . . . . . 49
120 1. Introduction
122 1.1. Context and Overview
124 Several years have passed since the original protocol for
125 Internationalized Domain Names (IDNs) was completed and deployed.
126 During that time, a number of issues have arisen, including a subset
127 of those described in a recent IAB report [RFC4690] and the need to
128 update the system to deal with newer versions of Unicode. Those
129 standards are known as Internationalized Domain Names in Applications
130 (IDNA), taken from the name of the highest level standard within that
131 group (see Section 1.5). Some tuning of the existing protocols and
132 the tables on which they depend is now required. Where it is
133 important to understanding of the revised protocols, this document
134 further explains the issues that have been encountered. It also
135 provides an overview of the new IDNA model and explanatory material
136 for it. Additional explanatory material for the specific components
137 of the proposals will appear with the associated documents.
139 1.2. Discussion Forum
141 [[anchor4: RFC Editor: please remove this section.]]
143 This work is being discussed in the IETF "idnabis" Working Group and
144 on the mailing list idna-update@alvestrand.no
146 1.3. Objectives
148 The intent of the IDNA revision effort, and hence of this document
149 and the associated ones, is to increase the usability and
150 effectiveness of internationalized domain names (IDNs) while
151 preserving or strengthening the integrity of references that use
152 them. The original "hostname" character definitions (see, e.g.,
153 [RFC0810]) struck a balance between the creation of useful mnemonics
154 and the introduction of parsing problems or general confusion in the
155 contexts in which domain names are used. Our objective is to
156 preserve that balance while expanding the character repertoire to
157 include extended versions of Roman-derived scripts and scripts that
158 are not Roman in origin. No work of this sort will be able to
159 completely eliminate sources of visual or textual confusion: such
160 confusion is possible even under the original rules where only ASCII
161 characters were permitted. However, one can hope, through the
162 application of different techniques at different points (see
163 Section 6.3), to keep problems to an acceptable minimum. One
164 consequence of this general objective is that the desire of some user
165 or marketing community to use a particular string --whether the
166 reason is to try to write sentences of particular languages in the
167 DNS, to express a facsimile of the symbol for a brand, or for some
168 other purpose-- is not a primary goal within the context of
169 applications in the domain name space.
171 1.4. Applicability and Function of IDNA
173 The IDNA standard does not require any applications to conform to it,
174 nor does it retroactively change those applications. An application
175 can elect to use IDNA in order to support IDN while maintaining
176 interoperability with existing infrastructure. If an application
177 wants to use non-ASCII characters in domain names, IDNA is the only
178 currently-defined option. Adding IDNA support to an existing
179 application entails changes to the application only, and leaves room
180 for flexibility in front-end processing and more specifically in the
181 user interface (see Section 9).
183 A great deal of the discussion of IDN solutions has focused on
184 transition issues and how IDNs will work in a world where not all of
185 the components have been updated. Proposals that were not chosen by
186 the original IDN Working Group would depend on user applications,
187 resolvers, and DNS servers being updated in order for a user to apply
188 an internationalized domain name in any form or coding acceptable
189 under that method. While processing must be performed prior to or
190 after access to the DNS, no changes are needed to the DNS protocol or
191 any DNS servers or the resolvers on user's computers.
193 The IDNA specification solves the problem of extending the repertoire
194 of characters that can be used in domain names to include a large
195 subset of the Unicode repertoire.
197 IDNA does not extend the service offered by DNS to the applications.
198 Instead, the applications (and, by implication, the users) continue
199 to see an exact-match lookup service. Either there is a single
200 exactly-matching name or there is no match. This model has served
201 the existing applications well, but it requires, with or without
202 internationalized domain names, that users know the exact spelling of
203 the domain names that are to be typed into applications such as web
204 browsers and mail user agents. The introduction of the larger
205 repertoire of characters potentially makes the set of misspellings
206 larger, especially given that in some cases the same appearance, for
207 example on a business card, might visually match several Unicode code
208 points or several sequences of code points.
210 IDNA allows the graceful introduction of IDNs not only by avoiding
211 upgrades to existing infrastructure (such as DNS servers and mail
212 transport agents), but also by allowing some rudimentary use of IDNs
213 in applications by using the ASCII representation of the non-ASCII
214 name labels. While such names are user-unfriendly to read and type,
215 and hence not optimal for user input, they allow (for instance)
216 replying to email and clicking on URLs even though the domain name
217 displayed is incomprehensible to the user. In order to allow user-
218 friendly input and output of the IDNs and acceptance of some
219 characters as equivalent to those to be processed according to the
220 protocol, the applications need to be modified to conform to this
221 specification.
223 IDNA uses the Unicode character repertoire, for continuity with
224 IDNA2003.
226 1.5. Terminology
228 1.5.1. Documents and Standards
230 This document uses the term "IDNA2003" to refer to the set of
231 standards that make up and support the version of IDNA published in
232 2003, i.e., those commonly known as the IDNA base specification
233 [RFC3490], Nameprep [RFC3491], Punycode [RFC3492], and Stringprep
234 [RFC3454]. In this document, those names are used to refer,
235 conceptually, to the individual documents, with the base IDNA
236 specification called just "IDNA".
238 The term "IDNA2008" is used to refer to a new version of IDNA as
239 described in this document and in the documents described in
240 Section 5. References to "these specifications" are to the entire
241 set.
243 1.5.2. Terminology about Characters and Character Sets
245 A code point is an integer value associated with a character in a
246 coded character set.
248 Unicode [Unicode51] is a coded character set containing almost
249 100,000 characters as of the current version. A single Unicode code
250 point is denoted by "U+" followed by four to six hexadecimal digits,
251 while a range of Unicode code points is denoted by two four to six
252 digit hexadecimal numbers separated by "..", with no prefixes.
254 ASCII means US-ASCII [ASCII], a coded character set containing 128
255 characters associated with code points in the range 0000..007F.
256 Unicode may be thought of as an extension of ASCII; it includes all
257 the ASCII characters and associates them with equivalent code points.
259 "Letters" are, informally, generalizations from the ASCII and common-
260 sense understanding of that term, i.e., characters that are used to
261 write text that are not digits, symbols, or punctuation. Formally,
262 they are characters with a Unicode General Category value starting in
263 "L" (see Section 4.5 of [Unicode51]).
265 1.5.3. DNS-related Terminology
267 When discussing the DNS, this document generally assumes the
268 terminology used in the DNS specifications [RFC1034] [RFC1035]. The
269 terms "lookup" and "resolution" are used interchangeably and the
270 process or application component that performs DNS resolution is
271 called a "resolver". The process of placing an entry into the DNS is
272 referred to as "registration" paralleling common contemporary usage
273 in other contexts. Consequently, any DNS zone administration is
274 described as a "registry", regardless of that actual administrative
275 arrangements or level in the tree. A note about that relationship is
276 included in the text below where it seems particularly significant.
278 The term "LDH code points" is defined in this document to mean the
279 code points associated with ASCII letters, digits, and the hyphen-
280 minus; that is, U+002D, 0030..0039, 0041..005A, and 0061..007A. "LDH"
281 is an abbreviation for "letters, digits, hyphen".
283 The base DNS specifications [RFC1034] [RFC1035] discuss "domain
284 names" and "host names", but many people and sections of these
285 specifications use the terms interchangeably. Further, because those
286 documents were not terribly clear, many people who are sure they know
287 the exact definitions of each of these terms disagree on the
288 definitions. This document generally uses the term "domain name".
289 When it refers to, e.g., host name syntax restrictions, it explicitly
290 cites the relevant defining documents. The remaining definitions in
291 this subsection are essentially a review.
293 A label is an individual component of a domain name. Labels are
294 usually shown separated by dots; for example, the domain name
295 "www.example.com" is composed of three labels: "www", "example", and
296 "com". (The zero-length root label described in [RFC1123], which can
297 be explicit as in "www.example.com." or implicit as in
298 "www.example.com", is not considered a label in this specification.)
299 IDNA extends the set of usable characters in labels that are text.
300 For the rest of this document, the term "label" is shorthand for
301 "text label", and "every label" means "every text label".
303 1.5.4. Terminology Specific to IDNA
305 This section defines some terminology to reduce dependence on terms
306 and definitions that have been problematic in the past.
308 1.5.4.1. Terms for IDN Label Codings
310 1.5.4.1.1. IDNA-valid strings, A-label, and U-label
312 To improve clarity, this document introduces three new terms in this
313 subsection. In the next, it defines a historical one to be slightly
314 more precise for IDNA contexts.
316 o A string is "IDNA-valid" if it meets all of the requirements of
317 these specifications for an IDNA label. IDNA-valid strings may
318 appear in either of two forms, defined immediately below. It is
319 expected that specific reference will be made to the form
320 appropriate to any context in which the distinction is important.
322 o An "A-label" is the ASCII-Compatible Encoding (ACE, see
323 Section 1.5.4.4) form of an IDNA-valid string. It must be a
324 complete label: IDNA is defined for labels, not for parts of them
325 and not for complete domain names. This means, by definition,
326 that every A-label will begin with the IDNA ACE prefix, "xn--",
327 followed by a string that is a valid output of the Punycode
328 algorithm and hence a maximum of 59 ASCII characters in length.
329 The prefix and string together must conform to all requirements
330 for a label that can be stored in the DNS including conformance to
331 the LDH ("host name") rule described in RFC 1034, RFC 1123 and
332 elsewhere.
334 o A "U-label" is an IDNA-valid string of Unicode characters,
335 including at least one non-ASCII character, expressed in a
336 standard Unicode Encoding Form, normally UTF-8 in an Internet
337 transmission context, and subject to the constraint below.
338 Conversions between valid U-labels and valid A-labels is performed
339 according to the specification in [RFC3492], adding or removing
340 the ACE prefix (see Section 1.5.4.4) as needed.
342 To be valid, U-labels and A-labels must obey an important symmetry
343 constraint. While that constraint may be tested in any of several
344 ways, an A-label must be capable of being produced by conversion from
345 a U-label and a U-label must be capable of being produced by
346 conversion from an A-label. Among other things, this implies that
347 both U-labels and A-labels must represent strings in normalized form.
348 These strings MUST contain only characters specified elsewhere in
349 this document and its companion documents, and only in the contexts
350 indicated as appropriate.
352 Any rules or conventions that apply to DNS labels in general, such as
353 rules about lengths of strings, apply to whichever of the U-label or
354 A-label would be more restrictive. For the U-label, constraints
355 imposed by existing protocols and their presentation forms make the
356 length restriction apply to the length in octets of the UTF-8 form of
357 those labels (which will always be greater than or equal to the
358 length in code points). The exception to this, of course, is that
359 the restriction to ASCII characters does not apply to the U-label.
361 A different way to look at these terms, which may be more clear to
362 some readers, is that U-labels, A-labels, and LDH-labels (see the
363 next subsection) are disjoint categories that, together, make up the
364 forms of legitimate strings for use in domain names that describe
365 hosts. Of the three, only A-labels and LDH-labels can actually
366 appear in DNS zone files or queries; U-labels can appear, along with
367 the other two, in presentation and user interface forms and in
368 selected protocols other than those of the DNS itself. Strings that
369 do not conform to the rules for one of these three categories and, in
370 particular, strings that contain "--" in the third and fourth
371 character position but are:
373 o not A-labels or
375 o cannot be processed as U-labels or A-labels as described in these
376 specifications,
378 are invalid in IDNA-conformant applications as labels in domain names
379 that identify Internet hosts or similar resources. This restriction
380 on strings containing "--" is required for three reasons:
382 o to prevent confusion with pre-IDNA coding forms;
384 o to permit future extensions that would require changing the
385 prefix, no matter how unlikely those might be (see Section 10.3);
386 and
388 o to reduce the opportunities for attacks via the encoding system.
390 1.5.4.2. LDH-label and Internationalized Label
392 In the hope of further clarifying discussions about IDNs, these
393 specifications use the term "LDH-label" strictly to refer to an all-
394 ASCII label that obeys the "hostname" (LDH) conventions and that is
395 not an IDN. In other words, only "U-label" and "A-label" refer to
396 IDNs; LDH-labels are not IDNs. "Internationalized label" is used
397 when a term is needed to refer to any of the three categories. There
398 are some standardized DNS label formats, such as those for service
399 location (SRV) records [RFC2782] that do not fall into any of the
400 three categories and hence are not internationalized labels.
402 1.5.4.3. Equivalence
404 In IDNA, equivalence of labels is defined in terms of the A-labels.
405 If the A-labels are equal in a case-independent comparison, then the
406 labels are considered equivalent, no matter how they are represented.
407 Traditional LDH labels already have a notion of equivalence: within
408 that list of characters, upper case and lower case are considered
409 equivalent. The IDNA notion of equivalence is an extension of that
410 older notion. Equivalent labels in IDNA are treated as alternate
411 forms of the same label, just as "foo" and "Foo" are treated as
412 alternate forms of the same label.
414 1.5.4.4. ACE Prefix
416 The "ACE prefix" is defined in this document to be a string of ASCII
417 characters "xn--" that appears at the beginning of every A-label.
418 "ACE" stands for "ASCII-Compatible Encoding".
420 1.5.4.5. Domain Name Slot
422 A "domain name slot" is defined in this document to be a protocol
423 element or a function argument or a return value (and so on)
424 explicitly designated for carrying a domain name. Examples of domain
425 name slots include: the QNAME field of a DNS query; the name argument
426 of the gethostbyname() or getaddrinfo() standard C library functions;
427 the part of an email address following the at-sign (@) in the
428 parameter to the SMTP MAIL or RCPT commands or the "From:" field of
429 an email message header; and the host portion of the URI in the src
430 attribute of an HTML tag. General text that just happens to
431 contain a domain name is not a domain name slot. For example, a
432 domain name appearing in the plain text body of an email message is
433 not occupying a domain name slot.
435 An "IDN-aware domain name slot" is defined in this document to be a
436 domain name slot explicitly designated for carrying an
437 internationalized domain name as defined in this document. The
438 designation may be static (for example, in the specification of the
439 protocol or interface) or dynamic (for example, as a result of
440 negotiation in an interactive session).
442 An "IDN-unaware domain name slot" is defined in this document to be
443 any domain name slot that is not an IDN-aware domain name slot.
444 Obviously, this includes any domain name slot whose specification
445 predates IDNA.
447 1.5.5. Punycode is an Algorithm, not a Name
449 There has been some confusion about whether a "Punycode string" does
450 or does not include the prefix and about whether it is required that
451 such strings could have been the output of ToASCII (see RFC 3490,
452 Section 4 [RFC3490]). This specification discourages the use of the
453 term "Punycode" to describe anything but the encoding method and
454 algorithm of [RFC3492]. The terms defined above are preferred as
455 much more clear than terms such as "Punycode string".
457 1.5.6. Other Terminology Issues
459 The document departs from historical DNS terminology and usage in one
460 important respect. Over the years, the community has talked very
461 casually about "names" in the DNS, beginning with calling it "the
462 domain name system". That terminology is fine in the very precise
463 sense that the identifiers of the DNS do provide names for objects
464 and addresses. But, in the context of IDNs, the term has introduced
465 some confusion, confusion that has increased further as people have
466 begun to speak of DNS labels in terms of the words or phrases of
467 various natural languages.
469 Historically, many, perhaps most, of the "names" in the DNS have been
470 mnemonics to identify some particular concept, object, or
471 organization. They are typically derived from, or rooted in, some
472 language because most people think in language-based ways. But,
473 because they are mnemonics, they need not obey the orthographic
474 conventions of any language: it is not a requirement that it be
475 possible for them to be "words".
477 This distinction is important because the reasonable goal of an IDN
478 effort is not to be able to write the great Klingon (or language of
479 one's choice) novel in DNS labels but to be able to form a usefully
480 broad range of mnemonics in ways that are as natural as possible in a
481 very broad range of scripts.
483 An "internationalized domain name" (IDN) is a domain name that may
484 contain any mixture of LDH-labels, A-labels, or U-labels. This
485 implies that every conventional domain name is an IDN (which implies
486 that it is possible for a domain name to be an IDN without it
487 containing any non-ASCII characters). Just as has been the case with
488 ASCII names, some DNS zone administrators may impose restrictions,
489 beyond those imposed by DNS or IDNA, on the characters or strings
490 that may be registered as labels in their zones. Because of the
491 diversity of characters that can be used in a U-label and the
492 confusion they might cause, such restrictions are mandatory for IDN
493 registries and zones even though the particular restrictions are not
494 part of these specifications. Because these restrictions, commonly
495 known as "registry restrictions", only affect what can be registered
496 and not resolution processing, they have no effect on the syntax or
497 semantics of DNS protocol messages; a query for a name that matches
498 no records will yield the same response regardless of the reason why
499 it is not in the zone. Clients issuing queries or interpreting
500 responses cannot be assumed to have any knowledge of zone-specific
501 restrictions or conventions. See Section 6.2.
503 "The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
504 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
505 document are to be interpreted as described in RFC 2119 [RFC2119].
507 1.6. Comprehensibility of IDNA Mechanisms and Processing
509 One of the major goals of this work is to improve the general
510 understanding of how IDNA works and what characters are permitted and
511 what happens to them. Comprehensibility and predictability to users
512 and registrants are themselves important motivations and design goals
513 for this effort. The effort includes some new terminology and a
514 revised and extended model, both covered in this section, and some
515 more specific protocol, processing, and table modifications. Details
516 of the latter appear in other documents (see Section 5).
518 Several issues are inherent in the application of IDNs and, indeed,
519 almost any other system that tries to handle international characters
520 and concepts. They range from the apparently trivial --e.g., one
521 cannot display a character for which one does not have a font
522 available locally-- to the more complex and subtle. Many people have
523 observed that internationalization is just a tool to enable effective
524 localization while permitting some global uniformity. Issues of
525 display, of exactly how various strings and characters are entered,
526 and so on are inherently issues about localization and user interface
527 design.
529 A protocol such as IDNA can only assume that such operations as data
530 entry and reconciliation of differences in character forms are
531 possible. It may make some recommendations about how display might
532 work when characters and fonts are not available, but they can only
533 be general recommendations and, because display functions are rarely
534 controlled by the types of applications that would call upon IDNA,
535 will rarely be very effective.
537 However, shifting responsibility for character mapping and other
538 adjustments from the protocol (where it was located in IDNA2003) to
539 the user interface or processing before invoking IDNA raises issues
540 about both what that processing should do and about compatibility for
541 references prepared in an IDNA2003 context. Those issues are
542 discussed in Section 9.
544 Operations for converting between local character sets and normalized
545 Unicode are part of this general set of user interface issues. The
546 conversion is obviously not required at all in a Unicode-native
547 system that maintains all strings in Normalization Form C (NFC). It
548 may, however, involve some complexity in a system that is not
549 Unicode-native, especially if the elements of the local character set
550 do not map exactly and unambiguously into Unicode characters or do so
551 in a way that is not completely stable over time. Perhaps more
552 important, if a label being converted to a local character set
553 contains Unicode characters that have no correspondence in that
554 character set, the application may have to apply special, locally-
555 appropriate, methods to avoid or reduce loss of information.
557 Depending on the system involved, the major difficulty may not lie in
558 the mapping but in accurately identifying the incoming character set
559 and then applying the correct conversion routine. If a local
560 operating system uses one of the ISO 8859 character sets or an
561 extensive national or industrial system such as GB18030 [GB18030] or
562 BIG5 [BIG5], one must correctly identify the character set in use
563 before converting to Unicode even though those character coding
564 systems are substantially or completely Unicode-compatible (i.e., all
565 of the code points in them have an exact and unique mapping to
566 Unicode code points). It may be even more difficult when the
567 character coding system in local use is based on conceptually
568 different assumptions than those used by Unicode about, e.g., about
569 font encodings used for publications in some Indic scripts. Those
570 differences may not easily yield unambiguous conversions or
571 interpretations even if each coding system is internally consistent
572 and adequate to represent the local language and script.
574 2. Summary of Major Changes from IDNA2003
576 1. Update base character set from Unicode 3.2 to Unicode version-
577 agnostic.
579 2. Separate the definitions for the "registration" and "lookup"
580 activities.
582 3. Disallow symbol and punctuation characters except where special
583 exceptions are necessary.
585 4. Remove the mapping and normalization steps from the protocol and
586 have them instead done by the applications themselves, possibly
587 in a local fashion, before invoking the protocol.
589 5. Change the way that the protocol specifies which characters are
590 allowed in labels from "humans decide what the table of
591 codepoints contains" to "decision about codepoints are based on
592 Unicode properties plus a small exclusion list created by
593 humans".
595 6. Introduce the new concept of characters that can be used only in
596 specific contexts.
598 7. Allow typical words and names in languages such as Dhivehi and
599 Yiddish to be expressed.
601 8. Make bidirectional domain names (delimited strings of labels,
602 not just labels standing on their own) display in a non-
603 surprising fashion.
605 9. Make bidirectional domain names in a paragraph display in a non-
606 surprising fashion.[[anchor17: Is this statement necessary or is
607 it redundant with the previous one?]]
609 10. Remove the dot separator from the mandatory part of the
610 protocol.
612 11. Make some currently-valid labels that are not actually IDNA
613 labels invalid.
615 3. The Revised IDNA Model
617 IDNA is a client-side protocol, i.e., almost all of the processing is
618 performed by the client. The strings that appear in, and are
619 resolved by, the DNS conform to the traditional rules for the naming
620 of hosts, and consist of ASCII letters, digits, and hyphens. This
621 approach permits IDNA to be deployed without modifications to the DNS
622 itself. That, in turn, avoids both having to upgrade the entire
623 Internet to support IDNs and needing to incur the unknown risks to
624 deployed systems of DNS structural or design changes especially if
625 those changes need to be deployed all at the same time.
627 4. Processing in IDNA2008
629 These specifications separate Domain Name Registration and Resolution
630 in the protocol specification. Doing so reflects current practice in
631 which per-registry restrictions and special processing are applied at
632 registration time but not on resolution. Even more important in the
633 longer term, it facilitates incremental addition of permitted
634 character groups to avoid freezing on one particular version of
635 Unicode.
637 The actual registration and lookup protocols for IDNA2008 are
638 specified in [IDNA2008-Protocol].
640 5. IDNA2008 Document List
642 [[anchor19: This section will need to be extensively revised or
643 removed before publication.]]
644 The following documents are being produced as part of the IDNA2008
645 effort.
647 o A revised version of this document, containing an overview,
648 rationale, and conformance conditions.
650 o A separate document, drawn from material in early versions of this
651 one, that explicitly updates and replaces RFC 3490 but which has
652 most rationale material from that document moved to this one
653 [IDNA2008-Protocol].
655 o A document describing the "Bidi problem" with Stringprep and
656 proposing a solution [IDNA2008-Bidi].
658 o A specification of the categories and rules that identify the code
659 points allowed in a U-label, based on Unicode 5.0 code
660 assignments. See Section 6 and [IDNA2008-Tables].
662 o One or more documents containing guidance and suggestions for
663 registries (in this context, those responsible for establishing
664 policies for any zone file in the DNS, not only those at the top
665 or second level). The documents in this category may not be IETF
666 products and may be prepared and completed asynchronously with
667 those described above.
669 6. Permitted Characters: An Inclusion List
671 This section provides an overview of the model used to establish the
672 algorithm and character lists of [IDNA2008-Tables] and describes the
673 names and applicability of the categories used there. Note that the
674 inclusion of a character in the first category group does not imply
675 that it can be used indiscriminately; some characters are associated
676 with contextual rules that must be applied as well.
678 The information given in this section is provided to make the rules,
679 tables, and protocol easier to understand. It is not normative. The
680 normative generating rules appear in [IDNA2008-Tables] and the rules
681 that actually determine what labels can be registered or looked up
682 are in [IDNA2008-Protocol].
684 6.1. A Tiered Model of Permitted Characters and Labels
686 Moving to an inclusion model requires respecifying the list of
687 characters that are permitted in IDNs. In IDNA2003, the role and
688 utility of characters are independent of context and fixed forever
689 (or until the standard is replaced). Making completely context-
690 independent rules globally has proven impractical because some
691 characters, especially those that are called "Join_Controls" in
692 Unicode, are needed to make reasonable use of some scripts but have
693 no visible effect(s) in others. Of necessity, IDNA2003 prohibited
694 those types of characters entirely. But the restrictions were much
695 too severe to permit an adequate range of mnemonics for terminology
696 based on some languages. The requirement to support those characters
697 but limit their use to very specific contexts was reinforced by the
698 observation that handling of particular characters across the
699 languages that use a script, or the use of similar or identical-
700 looking characters in different scripts, is less well understood than
701 many people believed it was several years ago.
703 Independently of the characters chosen (see next subsection), the
704 theory is to divide the characters that appear in Unicode into three
705 categories:
707 6.1.1. PROTOCOL-VALID
709 Characters identified as "PROTOCOL-VALID" (often abbreviated
710 "PVALID") are, in general, permitted by IDNA for all uses in IDNs.
711 Their use may be restricted by rules about the context in which they
712 appear or by other rules that apply to the entire label in which they
713 are to be embedded. For example, any label that contains a character
714 in this group that has a "right to left" property must be used in
715 context with the "Bidi" rules (see [IDNA2008-Bidi]).
717 The term "PROTOCOL-VALID", is used to stress the fact that the
718 presence of a character in this category does not imply that a given
719 registry need accept registrations containing any of the characters
720 in the category. Registries are still expected to apply judgment
721 about labels they will accept and to maintain rules consistent with
722 those judgments (see [IDNA2008-Protocol] and Section 6.3).
724 Characters that are placed in the "PROTOCOL-VALID" category are never
725 removed from it unless the code points themselves are removed from
726 Unicode (such removal would be inconsistent with the Unicode
727 stability principles (see [Unicode51], Appendix F) and hence should
728 never occur).
730 [[anchor21: Placeholder: Does this topic or comment need additional
731 discussion or explanation?]]
733 6.1.1.1. Contextual Rules
735 Some characters may be unsuitable for general use in IDNs but
736 necessary for the plausible support of some scripts. The two most
737 commonly-cited examples are the zero-width joiner and non-joiner
738 characters (ZWNJ, U+200C, and ZWJ, U+200D), but provisions for
739 unambiguous labels may require that other characters be restricted to
740 particular contexts. For example, the ASCII hyphen is not permitted
741 to start or end a label, whether that label contains non-ASCII
742 characters or not.
744 These characters must not appear in IDNs without additional
745 restrictions, typically because they have no visible consequences in
746 most scripts but affect format or presentation in a few others or
747 because they are combining characters that are safe for use only in
748 conjunction with particular characters or scripts. In order to
749 permit them to be used at all, they are specially identified as
750 "CONTEXTUAL RULE REQUIRED" and, when adequately understood,
751 associated with a rule. In addition, the rule will define whether it
752 is to be applied on lookup as well as registration. A distinction is
753 made between characters that indicate or prohibit joining (known as
754 "CONTEXT-JOINER" or "CONTEXTJ") and other characters requiring
755 contextual treatment ("CONTEXT-OTHER" or "CONTEXTO"). Only the
756 former are fully tested at lookup time.
758 6.1.1.2. Rules and Their Application
760 The actual rules may be present or absent. If present, they may have
761 values of "True" (character may be used in any position in any
762 label), "False" (character may not be used in any label), or may be
763 an extended regular expression that specifies the context in which
764 the character is permitted.
766 Examples of descriptions of typical rules, stated informally and in
767 English, include "Must follow a character from Script XYZ", "MUST
768 occur only if the entire label is in Script ABC", "MUST occur only if
769 the previous and subsequent characters have the DFG property".
771 Because it is easier to identify these characters than to know that
772 they are actually needed in IDNs or how to establish exactly the
773 right rules for each one, a rule may have a null value in a given
774 version of the tables. Characters associated with null rules MUST
775 NOT appear in putative labels for either registration or lookup. Of
776 course, a later version of the tables might contain a non-null rule.
778 [[anchor23: Definition of regular expression language to be supplied
779 or replaced with a description of the definitional technique. It may
780 be useful to more more of this material to Tables as part of moving
781 the rules from Protocol to Tables.]]
783 6.1.2. DISALLOWED
785 Some characters are sufficiently problematic for use in IDNs that
786 they should be excluded for both registration and lookup (i.e.,
787 conforming applications performing name resolution should verify that
788 these characters are absent; if they are present, the label strings
789 should be rejected rather than converted to A-labels and looked up.
791 Of course, this category would include code points that had been
792 removed entirely from Unicode should such removals ever occur.
794 Characters that are placed in the "DISALLOWED" category are expected
795 to never be removed from it or reclassified. If a character is
796 classified as "DISALLOWED" in error and the error is sufficiently
797 problematic, the only recourse would be either to introduce a new
798 code point into Unicode and classify it as "PROTOCOL-VALID" or for
799 the IETF to accept the considerable costs of an incompatible change
800 and replace the relevant RFC with one containing appropriate
801 exceptions.
803 [[anchor24: Note in Draft: the permanence of DISALLOWED was still
804 under discussion in the WG when this draft was posted. The text
805 above reflects the editor's opinion about the emerging consensus but
806 is subject to change as the discussion continues.]]
808 There is provision for exception cases but, in general, characters
809 are placed into "DISALLOWED" if they fall into one or more of the
810 following groups:
812 o The character is a compatibility equivalent for another character.
813 In slightly more precise Unicode terms, application of
814 normalization method NFKC to the character yields some other
815 character.
817 o The character is an upper-case form or some other form that is
818 mapped to another character by Unicode casefolding.
820 o The character is a symbol or punctuation form or, more generally,
821 something that is not a letter, digit, or a mark that is used to
822 form a letter or digit.
824 6.1.3. UNASSIGNED
826 For convenience in processing and table-building, code points that do
827 not have assigned values in a given version of Unicode are treated as
828 belonging to a special UNASSIGNED category. Such code points MUST
829 NOT appear in labels to be registered or looked up. The category
830 differs from DISALLOWED in that code points are moved out of it by
831 the simple expedient of being assigned in a later version of Unicode
832 (at which point, they are classified into one of the other categories
833 as appropriate).
835 6.2. Registration Policy
837 While these recommendations cannot and should not define registry
838 policies, registries SHOULD develop and apply additional restrictions
839 to reduce confusion and other problems. For example, it is generally
840 believed that labels containing characters from more than one script
841 are a bad practice although there may be some important exceptions to
842 that principle. Some registries may choose to restrict registrations
843 to characters drawn from a very small number of scripts. For many
844 scripts, the use of variant techniques such as those as described in
845 [RFC3743] and [RFC4290], and illustrated for Chinese by the tables
846 described in RFC 4713 [RFC4713] may be helpful in reducing problems
847 that might be perceived by users. It is worth stressing that these
848 principles of policy development and application apply at all levels
849 of the DNS, not only, e.g., TLD registrations.
851 6.3. Layered Restrictions: Tables, Context, Registration, Applications
853 The essence of the character rules in IDNA2008 is based on the
854 realization that there is no magic bullet for any of the issues
855 associated with a multiscript DNS. Instead, the specifications
856 define a variety of approaches that, together, constitute multiple
857 lines of defense against ambiguity in identifiers and loss of
858 referential integrity. The actual character tables are the first
859 mechanism, protocol rules about how those characters are applied or
860 restricted in context are the second, and those two in combination
861 constitute the limits of what can be done by a protocol alone. As
862 discussed in the previous section (Section 6.2), registries are
863 expected to restrict what they permit to be registered, devising and
864 using rules that are designed to optimize the balance between
865 confusion and risk on the one hand and maximum expressiveness in
866 mnemonics on the other.
868 In addition, there is an important role for user agents in warning
869 against label forms that appear unreasonable given their knowledge of
870 local contexts and conventions. Of course, no approach based on
871 naming or identifiers alone can protect against all threats.
872 [[anchor25: Note in Draft: the last sentence above basically
873 duplicates a comment in Security Considerations. Is it worth having
874 in both places??]]
876 7. Issues that Constrain Possible Solutions
878 7.1. Display and Network Order
880 The correct treatment of domain names requires a clear distinction
881 between Network Order (the order in which the code points are sent in
882 protocols) and Display Order (the order in which the code points are
883 displayed on a screen or paper). The order of labels in a domain
884 name that contains characters that are normally written right to left
885 is discussed in [IDNA2008-Bidi]. In particular, there are questions
886 about the order in which labels are displayed if left to right and
887 right to left labels are adjacent to each other, especially if there
888 are also multiple consecutive appearances of one of the types. The
889 decision about the display order is ultimately under the control of
890 user agents --including web browsers, mail clients, and the like--
891 which may be highly localized. Even when formats are specified by
892 protocols, the full composition of an Internationalized Resource
893 Identifier (IRI) [RFC3987] or Internationalized Email address
894 contains elements other than the domain name. For example, IRIs
895 contain protocol identifiers and field delimiter syntax such as
896 "http://" or "mailto:" while email addresses contain the "@" to
897 separate local parts from domain names. User agents are not required
898 to use those protocol-based forms directly but often do so. While
899 display, parsing, and processing within a label is specified by the
900 IDNA protocol and the associated documents, the relationship between
901 fully-qualified domain names and internationalized labels is
902 unchanged from the base DNS specifications. Comments here about such
903 full domain names are explanatory or examples of what might be done
904 and must not be considered normative.
906 Questions remain about protocol constraints implying that the overall
907 direction of these strings will always be left to right (or right to
908 left) for an IRI or email address, or if they even should conform to
909 such rules. These questions also have several possible answers.
910 Should a domain name abc.def, in which both labels are represented in
911 scripts that are written right to left, be displayed as fed.cba or
912 cba.fed? An IRI for clear text web access would, in network order,
913 begin with "http://" and the characters will appear as
914 "http://abc.def" -- but what does this suggest about the display
915 order? When entering a URI to many browsers, it may be possible to
916 provide only the domain name and leave the "http://" to be filled in
917 by default, assuming no tail (an approach that does not work for
918 other protocols). The natural display order for the typed domain
919 name on a right to left system is fed.cba. Does this change if a
920 protocol identifier, tail, and the corresponding delimiters are
921 specified?
923 While logic, precedent, and reality suggest that these are questions
924 for user interface design, not IETF protocol specifications,
925 experience in the 1980s and 1990s with mixing systems in which domain
926 name labels were read in network order (left to right) and those in
927 which those labels were read right to left would predict a great deal
928 of confusion, and heuristics that sometimes fail, if each
929 implementation of each application makes its own decisions on these
930 issues.
932 It should be obvious that any revision of IDNA, including the current
933 one, must be clear about the network (transmission on the wire) order
934 of characters in labels and for the labels in complete (fully-
935 qualified) domain names. In order to prevent user confusion and, in
936 particular, to reduce the chances for inconsistent transcription of
937 domain names from printed form, it is likely that some strong
938 suggestions should be made about display order as well.
940 7.2. Entry and Display in Applications
942 Applications can accept domain names using any character set or sets
943 desired by the application developer or specified by the operating
944 system, and can display domain names in any charset. That is, the
945 IDNA protocol does not affect the interface between users and
946 applications.
948 An IDNA-aware application can accept and display internationalized
949 domain names in two formats: the internationalized character set(s)
950 supported by the application (i.e., an appropriate local
951 representation of a U-label), and as an A-label. Applications MAY
952 allow the display and user input of A-labels, but are encouraged to
953 not do so except as an interface for special purposes, possibly for
954 debugging, or to cope with display limitations. A-labels are opaque
955 and ugly, and, where possible, should thus only be exposed to users
956 and in contexts in which they are absolutely needed. Because IDN
957 labels can be rendered either as the A-labels or U-labels, the
958 application may reasonably have an option for the user to select the
959 preferred method of display; if it does, rendering the U-label should
960 normally be the default.
962 Domain names are often stored and transported in many places. For
963 example, they are part of documents such as mail messages and web
964 pages. They are transported in many parts of many protocols, such as
965 both the control commands and the RFC 2822 body parts of SMTP, and
966 the headers and the body content in HTTP. It is important to
967 remember that domain names appear both in domain name slots and in
968 the content that is passed over protocols.
970 In protocols and document formats that define how to handle
971 specification or negotiation of charsets, labels can be encoded in
972 any charset allowed by the protocol or document format. If a
973 protocol or document format only allows one charset, the labels MUST
974 be given in that charset. Of course, not all charsets can properly
975 represent all labels. If a U-label cannot be displayed in its
976 entirety, the only choice (without loss of information) may be to
977 display the A-label.
979 In any place where a protocol or document format allows transmission
980 of the characters in internationalized labels, labels SHOULD be
981 transmitted using whatever character encoding and escape mechanism
982 the protocol or document format uses at that place. This provision
983 is intended to prevent situations in which, e.g., UTF-8 domain names
984 appear embedded in text that is otherwise in some other character
985 coding.
987 All protocols that use domain name slots already have the capacity
988 for handling domain names in the ASCII charset. Thus, A-labels can
989 inherently be handled by those protocols.
991 7.3. Linguistic Expectations: Ligatures, Digraphs, and Alternate
992 Character Forms
994 Users often have expectations about character matching or equivalence
995 that are based on their languages and the orthography of those
996 languages. These expectations may not be consistent with forms or
997 actions that can be naturally accommodated in a character coding
998 system, especially if multiple languages are written using the same
999 script but using different conventions. A Norwegian user might
1000 expect a label with the ae-ligature to be treated as the same label
1001 as one using the Swedish spelling with a-umlaut even though applying
1002 that mapping to English would be astonishing to users. A user in
1003 German might expect a label with an o-umlaut and a label that had
1004 "oe" substituted, but was otherwise the same, treated as equivalent
1005 even though that substitution would be a clear error in Swedish. A
1006 Chinese user might expect automatic matching of Simplified and
1007 Traditional Chinese characters, but applying that matching for Korean
1008 or Japanese text would create considerable confusion. For that
1009 matter, an English user might expect "theater" and "theatre" to
1010 match.
1012 Related issues arise because there are a number of languages written
1013 with alphabetic scripts in which single phonemes are written using
1014 two characters, termed a "digraph", for example, the "ph" in
1015 "pharmacy" and "telephone". (Note that characters paired in this
1016 manner can also appear consecutively without forming a digraph, as in
1017 "tophat".) Certain digraphs are normally indicated typographically
1018 by setting the two characters closer together than they would be if
1019 used consecutively to represent different phonemes. Some digraphs
1020 are fully joined as ligatures (strictly designating setting totally
1021 without intervening white space, although the term is sometimes
1022 applied to close set pairs). An example of this may be seen when the
1023 word "encyclopaedia" is set with a U+00E6 LATIN SMALL LIGATURE AE
1024 (and some would not consider that word correctly spelled unless the
1025 ligature form was used or the "a" was dropped entirely). When these
1026 ligature and digraph forms have the same interpretation across all
1027 languages that use a given script, application of Unicode
1028 normalization generally resolves the differences and causes them to
1029 match. When they have different interpretations, any requirements
1030 for matching must utilize other methods or users must be educated to
1031 understand that matching will not occur.
1033 Difficulties arise from the fact that a given ligature may be a
1034 completely optional typographic convenience for representing a
1035 digraph in one language (as in the above example with some spelling
1036 conventions), while in another language it is a single character that
1037 may not always be correctly representable by a two-letter sequence
1038 (as in the above example with different spelling conventions). This
1039 can be illustrated by many words in the Norwegian language, where the
1040 "ae" ligature is the 27th letter of a 29-letter extended Latin
1041 alphabet. It is equivalent to the 28th letter of the Swedish
1042 alphabet (also containing 29 letters), U+00E4 LATIN SMALL LETTER A
1043 WITH DIAERESIS, for which an "ae" cannot be substituted according to
1044 current orthographic standards.
1046 That character (U+00E4) is also part of the German alphabet where,
1047 unlike in the Nordic languages, the two-character sequence "ae" is
1048 usually treated as a fully acceptable alternate orthography for the
1049 "umlauted a" character. The inverse is however not true, and those
1050 two characters cannot necessarily be combined into an "umlauted a".
1051 This also applies to another German character, the "umlauted o"
1052 (U+00F6 LATIN SMALL LETTER O WITH DIAERESIS) which, for example,
1053 cannot be used for writing the name of the author "Goethe". It is
1054 also a letter in the Swedish alphabet where, in parallel to the
1055 "umlauted a", it cannot be correctly represented as "oe" and in the
1056 Norwegian alphabet, where it is represented, not as "umlauted o", but
1057 as "slashed o", U+00F8.
1059 Some of the ligatures that have explicit code points in Unicode were
1060 given special handling in IDNA2003 and now pose additional problems
1061 as people argue that they should have been treated differently to
1062 preserve important information. For example, the German character
1063 Eszett (Sharp S, U+00DF) is retained as itself by NFKC but case-
1064 folded by Stringprep to "ss", but the closely-related, but less
1065 frequently seen, character "Long S T" (U+FB05) is a compatibility
1066 character that is mapped out by NFKC. Unless exceptions are made,
1067 both will be treated as DISALLOWED by IDNA2008. But there is
1068 significant interest in an exception, especially for Eszett.
1069 Depending on what the exception was, making it would either raise
1070 some backward compatibility problems with IDNA2003 or create an
1071 unusual special case that would highlight differences in preferred
1072 orthography between German as written in Germany and German as
1073 written in some other countries, notably Switzerland. Additional
1074 discussion of issues with Eszett appear in Section 10.7.
1076 Additional cases with alphabets written right to left are described
1077 in Section 7.5.
1079 Whether ligatures and digraphs are to be treated as a sequence of
1080 characters or as a single standalone one constitute a problem that
1081 cannot be resolved solely by operating on scripts. They are,
1082 however, a key concern in the IDN context. Their satisfactory
1083 resolution will require support in policies set by registries, which
1084 therefore need to be particularly mindful not just of this specific
1085 issue, but of all other related matters that cannot be dealt with on
1086 an exclusively algorithmic basis.
1088 Just as with the examples of different-looking characters that may be
1089 assumed to be the same, it is in general impossible to deal with
1090 these situations in a system such as IDNA -- or with Unicode
1091 normalization generally -- since determining what to do requires
1092 information about the language being used, context, or both.
1093 Consequently, these specifications make no attempt to treat these
1094 combined characters in any special way. However, their existence
1095 provides a prime example of a situation in which a registry that is
1096 aware of the language context in which labels are to be registered,
1097 and where that language sometimes (or always) treats the two-
1098 character sequences as equivalent to the combined form, should give
1099 serious consideration to applying a "variant" model [RFC3743]
1100 [RFC4290] to reduce the opportunities for user confusion and fraud
1101 that would result from the related strings being registered to
1102 different parties.
1104 7.4. Case Mapping and Related Issues
1106 Traditionally in the DNS, ASCII letters have been stored with their
1107 case preserved. Matching during the query process has been case-
1108 independent, but none of the information that might be represented by
1109 choices of case has been lost. That model has been accidentally
1110 helpful because, as people have created DNS labels by catenating
1111 words (or parts of words) to form labels, case has often been used to
1112 distinguish among components and make the labels more memorable.
1114 The solution of keeping the characters separate but doing matching
1115 independent of case is not feasible with an IDNA-like model because
1116 the matching would then have to be done on the server rather than
1117 have characters mapped on the client. That situation was recognized
1118 in IDNA2003 and nothing in IDNA2008 fundamentally changes it or could
1119 do so. In IDNA2003, all upper-case characters are mapped to lower-
1120 case ones and, in general, all code points that represent alternate
1121 forms of the same character are mapped to that character (including
1122 mapping Greek final form sigma to the medial form). IDNA2008
1123 permits, at the risk of some incompatibility, slightly more
1124 flexibility in this area. That additional flexibility still does not
1125 solve the problem with final form sigma and other characters that
1126 Unicode treats as completely separate characters that match only
1127 under casemapping if at all. Many people now believe these should be
1128 handled as separate characters so information about them can be
1129 preserved in the transformations to A-labels and back. However
1130 making a change to permit that behavior would create a situation in
1131 which the same string, valid in both protocols, would be interpreted
1132 differently by IDNA2003 and IDNA2008. In principle, that would
1133 violate one of the conditions discussed in Section 10.3.1 and hence
1134 require a prefix change. Of course, if a prefix change were made (at
1135 the costs discussed in Section 10.3.3) there would be several
1136 options, including, if desired, assigning the characer to the
1137 CONTEXTUAL RULE REQUIRED category and requiring that it only be used
1138 in carefully-selected contexts.
1140 7.5. Right to Left Text
1142 In order to be sure that the directionality of right to left text is
1143 unambiguous, IDNA2003 required that any label in which right to left
1144 characters appear both starts and ends with them, may not include any
1145 characters with strong left to right properties (which excludes other
1146 alphabetic characters but permits European digits), and rejects any
1147 other string that contains a right to left character. This is one of
1148 the few places where the IDNA algorithms (both old and new) are
1149 required to look at an entire label, not just at individual
1150 characters. The algorithmic model used in IDNA2003 rejects the label
1151 when the final character in a right to left string requires a
1152 combining mark in order to be correctly represented.
1154 This problem manifests itself in languages written with consonantal
1155 alphabets to which diacritical vocalic systems are applied, and in
1156 languages with orthographies derived from them where the combining
1157 marks may have different functionality. In both cases the combining
1158 marks can be essential components of the orthography. Examples of
1159 this are Yiddish, written with an extended Hebrew script, and Dhivehi
1160 (the official language of Maldives) which is written in the Thaana
1161 script (which is, in turn, derived from the Arabic script). The new
1162 rules for right to left scripts are described in [IDNA2008-Bidi].
1164 8. IDNs and the Robustness Principle
1166 The model of IDNs described in this document can be seen as a
1167 particular instance of the "Robustness Principle" that has been so
1168 important to other aspects of Internet protocol design. This
1169 principle is often stated as "Be conservative about what you send and
1170 liberal in what you accept" (See, e.g., RFC 1123, Section 1.2.2
1172 [RFC1123]). For IDNs to work well, not only must the protocol be
1173 carefully designed and implemented, but zone administrators
1174 (registries) must have and require sensible policies about what is
1175 registered -- conservative policies -- and implement and enforce
1176 them.
1178 Conversely, resolvers can (and SHOULD or maybe MUST) reject labels
1179 that clearly violate global (protocol) rules (no one has ever
1180 seriously claimed that being liberal in what is accepted requires
1181 being stupid). However, once one gets past such global rules and
1182 deals with anything sensitive to script or locale, it is necessary to
1183 assume that garbage has not been placed into the DNS, i.e., one must
1184 be liberal about what one is willing to look up in the DNS rather
1185 than guessing about whether it should have been permitted to be
1186 registered.
1188 As mentioned elsewhere, if a string doesn't resolve, it makes no
1189 difference whether it simply wasn't registered or was prohibited by
1190 some rule.
1192 If resolvers, as a user interface (UI) or other local matter, decide
1193 to warn about some strings that are valid under the global rules but
1194 that they perceive as dangerous, that is their prerogative and we can
1195 only hope that the market (and maybe regulators) will reinforce the
1196 good choices and discourage the poor ones. In this context, a
1197 resolver that decides a string that is valid under the protocol is
1198 dangerous and refuses to look it up is in violation of the protocols;
1199 one that is willing to look something up, but warns against it, is
1200 exercising a local choice.
1202 9. Front-end and User Interface Processing
1204 Domain names may be identified and processed in many contexts. They
1205 may be typed in by users either by themselves or as part of URIs or
1206 IRIs. They may occur in running text or be processed by one system
1207 after being provided in another. Systems may wish to try to
1208 normalize URLs so as to determine (or guess) whether a reference is
1209 valid or two references point to the same object without actually
1210 looking the objects up and comparing them. Some of these goals may
1211 be more easily and reliably satisfied than others. While there are
1212 strong arguments for any domain name that is placed "on the wire" --
1213 transmitted between systems -- to be in the minimum-ambiguity forms
1214 of A-labels, U-labels, or LDH-labels, it is inevitable that programs
1215 that process domain names will encounter variant forms. One source
1216 of such forms will be labels created under IDNA2003. Because of the
1217 way that protocol was specified, there are a significant number of
1218 domain names in files on the Internet that use characters that cannot
1219 be represented directly in domain names but for which interpretations
1220 are provided. There are two major categories of such characters,
1221 those that are removed by NFKC normalization and those upper-case
1222 characters that are mapped to lower-case (there are also a few
1223 characters that are given special-case mapping treatment in
1224 Stringprep). [[anchor29: The text above is a too obscure, but was
1225 intended to address the mapping differences between IDNA2003 and the
1226 current proposal. Patrik suggests the following, which will need
1227 some tuning before it can be inserted: One source of such forms will
1228 be labels created under IDNA2003 as some allowed labels where
1229 transformed before they where turned into its ascii (xn--) form so
1230 that ToUnicode(ToASCII(label)) != label. This is why IDNA2008
1231 explicitly define A-label and U-label being a form of the label that
1232 is stable when converting between A-label and U-label, without
1233 mappings. A different way of explaining this is that there could be
1234 already today domain names in files on the Internet that use
1235 characters that cannot be represented directly in domain names but
1236 for which interpretations are provided. There are two major
1237 categories of such characters, those that are removed by NFKC
1238 normalization and those upper-case characters that are mapped to
1239 lower-case (there are also a few characters that are given special-
1240 case mapping treatment in Stringprep)."]]
1242 Other issues in domain name identification and processing arise
1243 because IDNA2003 specified that several other characters be treated
1244 as equivalent to the ASCII period (dot, full stop) character used as
1245 a label separator. If a domain name appears in an arbitrary context
1246 (such as running text), it is difficult, even with only ASCII
1247 characters, to know whether a domain name (or a protocol parameter
1248 like a URI) is present and where it starts and ends. When using
1249 Unicode this gets even more difficult if treatment of certain special
1250 characters (like the dot that separates labels in a domain name)
1251 depends on context. That problem occurs if the dot is part of a
1252 domain name or not, which would mean that, contrary to common
1253 practice today, the primary heuristic for identifying a domain name
1254 depends on dots separating strings with no intervening spaces.
1255 [[anchor30: Above text is a substitute for an earlier (pre -01)
1256 version and is hoped to be more clear. Comments and improvements
1257 welcome.]]
1259 As discussed elsewhere in this document, the IDNA2008 model removes
1260 all of these mappings and interpretations, including the equivalence
1261 of different forms of dots, from the protocol, leaving such mappings
1262 to local processing. This should not be taken to imply that local
1263 processing is optional or can be avoided entirely. Instead, unless
1264 the program context is such that it is known that any IDNs that
1265 appear will be either U-labels or A-labels, some local processing of
1266 apparent domain name strings will be required, both to maintain
1267 compatibility with IDNA2003 and to prevent user astonishment. Such
1268 local processing, while not specified in this document or the
1269 associated ones, will generally take one of two forms:
1271 o Generic Preprocessing.
1272 When the context in which the program or system that processes
1273 domain names operates is global, a reasonable balance must be
1274 found that is sensitive to the broad range of local needs and
1275 assumptions while, at the same time, not sacrificing the needs of
1276 one language, script, or user population to those of another.
1278 For this case, the best practice will usually be to apply NFKC and
1279 case-mapping (or, perhaps better yet, Stringprep itself), plus
1280 dot-mapping where appropriate, to the domain name string prior to
1281 applying IDNA. That practice will not only yield a reasonable
1282 compromise of user experience with protocol requirements but will
1283 be almost completely compatible with the various forms permitted
1284 by IDNA2003.
1286 o Highly Localized Preprocessing.
1287 Unlike the case above, there will be some situations in which
1288 software will be highly localized for a particular environment and
1289 carefully adapted to the expectations of users in that
1290 environment. The many discussions about using the Internet to
1291 preserve and support local cultures suggest that these cases may
1292 be more common in the future than they have been so far.
1294 In these cases, we should avoid trying to tell implementers what
1295 they should do, if only because they are quite likely (and for
1296 good reason) to ignore us. We would assume that they would map
1297 characters that the intuitions of their users would suggest be
1298 mapped. One can imagine switches about whether some sorts of
1299 mappings occur, warnings before applying them or, in a slightly
1300 more extreme version of the approach taken in Internet Explorer
1301 version 7 (IE7), utterly refuse to handle "strange" characters at
1302 all if they appear in U-label form. None of those local decisions
1303 are a threat to interoperability as long as (i) only U-labels and
1304 A-labels are used in interchange with systems outside the local
1305 environment, (ii) no character that would be valid in a U-label as
1306 itself is mapped to something else, (iii) any local mappings are
1307 applied as a preprocessing step (or, for conversions from U-labels
1308 or A-labels to presentation forms, postprocessing), not as part of
1309 IDNA processing proper, and (iv) appropriate consideration is
1310 given to labels that might have entered the environment in
1311 conformance to IDNA2003. [[anchor31: Placeholder: there have been
1312 suggestions that this text be removed entirely. Comments (or
1313 improved text) welcome.]]
1315 10. Migration and Version Synchronization
1317 10.1. Design Criteria
1319 As mentioned above and in RFC 4690, two key goals of this work are to
1320 enable applications to be agnostic about whether they are being run
1321 in environments supporting any Unicode version from 3.2 onward and to
1322 permit incrementally adding permitted scripts and other character
1323 collections without disruption or, subsequent to this version,
1324 "heavy" processes such as formation of an IETF WG. The mechanisms
1325 that support this are outlined above, but this section reviews them
1326 in a context that may be more helpful to those who need to understand
1327 the approach and make plans for it.
1329 10.1.1. General IDNA Validity Criteria
1331 The general criteria for a putative label, and the collection of
1332 characters that make it up, to be considered IDNA-valid are:
1334 o The characters are "letters", marks needed to form letters,
1335 numerals, or other code points used to write words in some
1336 language. Symbols, drawing characters, and various notational
1337 characters are permanently excluded -- some because they are
1338 actively dangerous in URI, IRI, or similar contexts and others
1339 because there is no evidence that they are important enough to
1340 Internet operations or internationalization to justify inclusion
1341 and the complexities that would come with it (additional
1342 discussion and rationale for the symbol decision appears in
1343 Section 10.5).
1345 o Other than in very exceptional cases, e.g., where they are needed
1346 to write substantially any word of a given language, punctuation
1347 characters are excluded as well. The fact that a word exists is
1348 not proof that it should be usable in a DNS label and DNS labels
1349 are not expected to be usable for multiple-word phrases (although
1350 they are certainly not prohibited if the conventions and
1351 orthography of a particular language cause that to be possible).
1352 Even for English, very common constructions -- contractions like
1353 "don't" or "it's", names that are written with apostrophes such as
1354 "O'Reilly" or characters for which apostrophes are common
1355 substitutes, and words whose usually-preferred spellings retain
1356 diacritical marks from earlier forms -- cannot be represented in
1357 DNS labels.
1359 o Characters that are unassigned (have no character assignment at
1360 all) in the version of Unicode being used by the registry or
1361 application are not permitted, even on resolution (lookup). There
1362 are at least two reasons for this. Tests involving the context of
1363 characters (e.g., some characters being permitted only adjacent to
1364 ones of specific types but otherwise invisible or very problematic
1365 for other reasons) and integrity tests on complete labels are
1366 needed. Unassigned code points cannot be permitted because one
1367 cannot determine whether particular code points will require
1368 contextual rules (and what those rules should be) before
1369 characters are assigned to them and the properties of those
1370 characters fully understood. Second, Unicode specifies that an
1371 unassigned code point normalizes and case folds to itself. If the
1372 code point is later assigned to a character, and particularly if
1373 the newly-assigned code point has a combining class that
1374 determines its placement relative to other combining characters,
1375 it could normalize to some other code point or sequence, creating
1376 confusion and/or violating other rules listed here.
1378 o Any character that is mapped to another character by Nameprep2003
1379 or by a current version of NFKC is prohibited as input to IDNA
1380 (for either registration or resolution). Implementers of user
1381 interfaces to applications are free to make those conversions when
1382 they consider them suitable for their operating system
1383 environments, context, or users.
1385 Tables used to identify the characters that are IDNA-valid are
1386 expected to be driven by the principles above (described in more
1387 precise form in [IDNA2008-Tables]). The principles are not just an
1388 interpretation of the tables.
1390 10.1.2. Labels in Registration
1392 Anyone entering a label into a DNS zone must properly validate that
1393 label -- i.e., be sure that the criteria for that label are met -- in
1394 order for applications to work as intended. This principle is not
1395 new: for example, zone administrators are expected to verify that
1396 names meet "hostname" [RFC0952] or special service location formats
1397 [RFC2782] where necessary for the expected applications. For zones
1398 that will contain IDNs, support for Unicode version-independence
1399 requires restrictions on all strings placed in the zone. In
1400 particular, for such zones:
1402 o Any label that appears to be an A-label, i.e., any label that
1403 starts in "xn--", MUST be IDNA-valid, i.e., that they MUST be
1404 valid A-labels, as discussed in Section 3 above.
1406 o The Unicode tables (i.e., tables of code points, character
1407 classes, and properties) and IDNA tables (i.e., tables of
1408 contextual rules such as those described above), MUST be
1409 consistent on the systems performing or validating labels to be
1410 registered. Note that this does not require that tables reflect
1411 the latest version of Unicode, only that all tables used on a
1412 given system are consistent with each other.
1414 [[anchor33: Note in draft: the above text was changed significantly
1415 between -00 and -01 to clearly restrict its scope to zones supporting
1416 IDNA and to eliminate comments about labels containing "--" in the
1417 third and forth positions but with different prefixes. There appears
1418 to be consensus that more extensive rules belong in a "best
1419 practices" document about appropriate DNS labels, but that document
1420 is not in-scope for the IDNABIS WG.]]
1422 Under this model, a registry (or entity communicating with a registry
1423 to accomplish name registrations) will need to update its tables --
1424 both the Unicode-associated tables and the tables of permitted IDN
1425 characters -- to enable a new script or other set of new characters.
1426 It will not be affected by newer versions of Unicode, or newly-
1427 authorized characters, until and unless it wishes to make those
1428 registrations. The registration side is also responsible --under the
1429 protocol and to registrants and users-- for much more careful
1430 checking than is expected of applications systems that look names up,
1431 both checking as required by the protocol and checking required by
1432 whatever policies it develops for minimizing risks due to confusable
1433 characters and sequences and preserving language or script integrity.
1435 Systems looking up or resolving DNS labels, especially IDN DNS
1436 labels, MUST be able to assume that applicable registration rules
1437 were followed for names entered into the DNS.
1439 10.1.3. Labels in Resolution (Lookup)
1441 Anyone looking up a label in a DNS zone
1443 o MUST maintain a consistent set of tables, as discussed above. As
1444 with registration, the tables need not reflect the latest version
1445 of Unicode but they MUST be consistent.
1447 o MUST validate the characters in labels to be looked up only to the
1448 extent of determining that the U-label does not contain either
1449 code points prohibited by IDNA (categorized as "DISALLOWED") or
1450 code points that are unassigned in its version of Unicode.
1452 o MUST validate the label itself for conformance with a small number
1453 of whole-label rules, notably verifying that there are no leading
1454 combining marks, that the "bidi" conditions are met if right to
1455 left characters appear, that any required contextual rules are
1456 available and that, if such rules are associated with Joiner
1457 Controls, they are tested.
1459 o MUST NOT validate other contextual rules about characters,
1460 including mixed-script label prohibitions, although such rules MAY
1461 be used to influence presentation decisions in the user interface.
1463 By avoiding applying its own interpretation of which labels are valid
1464 as a means of rejecting lookup attempts, the resolver application
1465 becomes less sensitive to version incompatibilities with the
1466 particular zone registry associated with the domain name.
1468 An application or client that looks names up in the DNS will be able
1469 to resolve any name that is validly registered, as long as its
1470 version of the Unicode-associated tables is sufficiently up-to-date
1471 to interpret all of the characters in the label. It SHOULD
1472 distinguish, in its messages to users, between "label contains an
1473 unallocated code point" and other types of lookup failures. A
1474 failure on the basis of an old version of Unicode may lead the user
1475 to a desire to upgrade to a newer version, but will have no other ill
1476 effects (this is consistent with behavior in the transition to the
1477 DNS when some hosts could not yet handle some forms of names or
1478 record types).
1480 10.2. More Flexibility in User Agents
1482 These specifications do not perform mappings between one character or
1483 code point and others for any reason. Instead, they prohibits the
1484 characters that would be mapped to others by normalization, case
1485 folding, or other rules. As examples, while mathematical characters
1486 based on Latin ones are accepted as input to IDNA2003, they are
1487 prohibited in IDNA2008. Similarly, double-width characters and other
1488 variations are prohibited as IDNA input.
1490 Since the rules in [IDNA2008-Tables] provide that only strings that
1491 are stable under NFKC are valid, if it is convenient for an
1492 application to perform NFKC normalization before lookup, that
1493 operation is safe since this will never make the application unable
1494 to look up any valid string.
1496 In many cases these prohibitions should have no effect on what the
1497 user can type at resolution time. It is perfectly reasonable for
1498 systems that support user interfaces to perform some character
1499 mapping that is appropriate to the local environment. This would
1500 normally be done prior to actual invocation of IDNA. At least
1501 conceptually, the mapping would be part of the Unicode conversions
1502 discussed above and in [IDNA2008-Protocol]. However, those changes
1503 will be local ones only -- local to environments in which users will
1504 clearly understand that the character forms are equivalent. For use
1505 in interchange among systems, it appears to be much more important
1506 that U-labels and A-labels can be mapped back and forth without loss
1507 of information.
1509 One specific, and very important, instance of this strategy arises
1510 with case-folding. In the ASCII-only DNS, names are looked up and
1511 matched in a case-independent way, but no actual case-folding occurs.
1512 Names can be placed in the DNS in either upper or lower case form (or
1513 any mixture of them) and that form is preserved, returned in queries,
1514 and so on. IDNA2003 simulated that behavior by performing case-
1515 mapping at registration time (resulting in only lower-case IDNs in
1516 the DNS) and when names were looked up.
1518 As suggested earlier in this section, it appears to be desirable to
1519 do as little character mapping as possible consistent with having
1520 Unicode work correctly (e.g., NFC mapping to resolve different
1521 codings for the same character is still necessary although the
1522 specifications require that it be performed prior to invoking the
1523 protocol) and to make the mapping between A-labels and U-labels
1524 idempotent. Case-mapping is not an exception to this principle. If
1525 only lower case characters can be registered in the DNS (i.e., be
1526 present in a U-label), then IDNA2008 should prohibit upper-case
1527 characters as input. Some other considerations reinforce this
1528 conclusion. For example, an essential element of the ASCII case-
1529 mapping functions is that uppercase(character) must be equal to
1530 uppercase(lowercase(character)). That requirement may not be
1531 satisfied with IDNs. The relationship between upper case and lower
1532 case may even be language-dependent, with different languages (or
1533 even the same language in different areas) expecting different
1534 mappings. Of course, the expectations of users who are accustomed to
1535 a case-insensitive DNS environment will probably be well-served if
1536 user agents perform case mapping prior to IDNA processing, but the
1537 IDNA procedures themselves should neither require such mapping nor
1538 expect them when they are not natural to the localized environment.
1540 10.3. The Question of Prefix Changes
1542 The conditions that would require a change in the IDNA "prefix"
1543 ("xn--" for the version of IDNA specified in [RFC3490]) have been a
1544 great concern to the community. A prefix change would clearly be
1545 necessary if the algorithms were modified in a manner that would
1546 create serious ambiguities during subsequent transition in
1547 registrations. This section summarizes our conclusions about the
1548 conditions under which changes in prefix would be necessary and the
1549 implications of such a change.
1551 10.3.1. Conditions Requiring a Prefix Change
1553 An IDN prefix change is needed if a given string would resolve or
1554 otherwise be interpreted differently depending on the version of the
1555 protocol or tables being used. Consequently, work to update IDNs
1556 would require a prefix change if, and only if, one of the following
1557 four conditions were met:
1559 1. The conversion of an A-label to Unicode (i.e., a U-label) yields
1560 one string under IDNA2003 (RFC3490) and a different string under
1561 IDNA2008.
1563 2. An input string that is valid under IDNA2003 and also valid under
1564 IDNA2008 yields two different A-labels with the different
1565 versions of IDNA. This condition is believed to be essentially
1566 equivalent to the one above.
1568 Note, however, that if the input string is valid under one
1569 version and not valid under the other, this condition does not
1570 apply. See the first item in Section 10.3.2, below.
1572 3. A fundamental change is made to the semantics of the string that
1573 is inserted in the DNS, e.g., if a decision were made to try to
1574 include language or specific script information in that string,
1575 rather than having it be just a string of characters.
1577 4. A sufficiently large number of characters is added to Unicode so
1578 that the Punycode mechanism for block offsets no longer has
1579 enough capacity to reference the higher-numbered planes and
1580 blocks. This condition is unlikely even in the long term and
1581 certain not to arise in the next few years.
1583 10.3.2. Conditions Not Requiring a Prefix Change
1585 In particular, as a result of the principles described above, none of
1586 the following changes require a new prefix:
1588 1. Prohibition of some characters as input to IDNA. This may make
1589 names that are now registered inaccessible, but does not require
1590 a prefix change.
1592 2. Adjustments in Stringprep tables or IDNA actions, including
1593 normalization definitions, that affect characters that were
1594 already invalid under IDNA2003.
1596 3. Changes in the style of definitions of Stringprep or Nameprep
1597 that do not alter the actions performed by them.
1599 Of course, because these specifications do not involve changes to
1600 Stringprep or Nameprep, the third condition above and part of the
1601 second are moot.
1603 10.3.3. Implications of Prefix Changes
1605 While it might be possible to make a prefix change, the costs of such
1606 a change are considerable. Even if they wanted to do so, all
1607 registries could not convert all IDNA2003 ("xn--") registrations to a
1608 new form at the same time and synchronize that change with
1609 applications supporting lookup. Unless all existing registrations
1610 were simply to be declared invalid, and perhaps even then, systems
1611 that needed to support both labels with old prefixes and labels with
1612 new ones would first process a putative label under the IDNA2008
1613 rules and try to look it up and then, if it were not found, would
1614 process the label under IDNA2003 rules and look it up again. That
1615 process could significantly slow down all processing that involved
1616 IDNs in the DNS especially since, in principle, a fully-qualified
1617 name could contain a mixture of labels that were registered with the
1618 old and new prefixes, a situation that would make the use of DNS
1619 caching very difficult. In addition, looking up the same input
1620 string as two separate A-labels would create some potential for
1621 confusion and attacks, since they could, in principle, resolve to
1622 different targets.
1624 Consequently, a prefix change is to be avoided if at all possible,
1625 even if it means accepting some IDNA2003 decisions about character
1626 distinctions as irreversible.
1628 10.4. Stringprep Changes and Compatibility
1630 Concerns have been expressed about problems for non-DNS uses of
1631 Stringprep being caused by changes to the specification intended to
1632 improve the handling of IDNs, most notably as this might affect
1633 identification and authentication protocols. Section 10.3, above,
1634 essentially also applies in this context. The proposed new inclusion
1635 tables [IDNA2008-Tables], the reduction in the number of characters
1636 permitted as input for registration or resolution (Section 6), and
1637 even the proposed changes in handling of right to left strings
1638 [IDNA2008-Bidi] either give interpretations to strings prohibited
1639 under IDNA2003 or prohibit strings that IDNA2003 permitted. Strings
1640 that are valid under both IDNA2003 and IDNA2008, and the
1641 corresponding versions of Stringprep, are not changed in
1642 interpretation. This protocol does not use either Nameprep or
1643 Stringprep as specified in IDNA2003.
1645 It is particularly important to keep IDNA processing separate from
1646 processing for various security protocols because some of the
1647 constraints that are necessary for smooth and comprehensible use of
1648 IDNs may be unwanted or undesirable in other contexts. For example,
1649 the criteria for good passwords or passphrases are very different
1650 from those for desirable IDNs. Similarly, internationalized SCSI
1651 identifiers and other protocol components are likely to have
1652 different requirements than IDNs.
1654 Perhaps even more important in practice, since most other known uses
1655 of Stringprep encode or process characters that are already in
1656 normalized form and expect the use of only those characters that can
1657 be used in writing words of languages, the changes proposed here and
1658 in [IDNA2008-Tables] are unlikely to have any effect at all,
1659 especially not on registries and registrations that follow rules
1660 already in existence when this work started.
1662 10.5. The Symbol Question
1664 One of the major differences between this specification and the
1665 original version of IDNA is that the original version permitted non-
1666 letter symbols of various sorts, including punctuation and line-
1667 drawing symbols, in the protocol. They were always discouraged in
1668 practice. In particular, both the "IESG Statement" about IDNA and
1669 all versions of the ICANN Guidelines specify that only language
1670 characters be used in labels. This specification disallows symbols
1671 entirely. There are several reasons for this, which include:
1673 o As discussed elsewhere, the original IDNA specification assumed
1674 that as many Unicode characters as possible should be permitted,
1675 directly or via mapping to other characters, in IDNs. This
1676 specification operates on an inclusion model, extrapolating from
1677 the LDH rules --which have served the Internet very well-- to a
1678 Unicode base rather than an ASCII base.
1680 o Most Unicode names for letters are, in most cases, fairly
1681 intuitive, unambiguous and recognizable to users of the relevant
1682 script. Symbol names are more problematic because there may be no
1683 general agreement on whether a particular glyph matches a symbol;
1684 there are no uniform conventions for naming; variations such as
1685 outline, solid, and shaded forms may or may not exist; and so on.
1686 As just one example, consider a "heart" symbol as it might appear
1687 in a logo that might be read as "I love...". While the user might
1688 read such a logo as "I love..." or "I heart...", considerable
1689 knowledge of the coding distinctions made in Unicode is needed to
1690 know that there more than one "heart" character (e.g., U+2665,
1691 U+2661, and U+2765) and how to describe it. These issues are of
1692 particular importance if strings are expected to be understood or
1693 transcribed by the listener after being read out loud.
1694 [[anchor35: The above paragraph remains controversial as to
1695 whether it is valid. The WG will need to make a decision if this
1696 section is not dropped entirely.]]
1698 o As a simplified example of this, assume one wanted to use a
1699 "heart" or "star" symbol in a label. This is problematic because
1700 the those names are ambiguous in the Unicode system of naming (the
1701 actual Unicode names require far more qualification). A user or
1702 would-be registrant has no way to know --absent careful study of
1703 the code tables-- whether it is ambiguous (e.g., where there are
1704 multiple "heart" characters) or not. Conversely, the user seeing
1705 the hypothetical label doesn't know whether to read it --try to
1706 transmit it to a colleague by voice-- as "heart", as "love", as
1707 "black heart", or as any of the other examples below.
1709 o The actual situation is even worse than this. There is no
1710 possible way for a normal, casual, user to tell the difference
1711 between the hearts of U+2665 and U+2765 and the stars of U+2606
1712 and U+2729 or the without somehow knowing to look for a
1713 distinction. We have a white heart (U+2661) and few black hearts
1714 and describing a label containing a heart symbol is hopelessly
1715 ambiguous. In cities where "Square" is a popular part of a
1716 location name, one might well want to use a square symbol in a
1717 label as well and there are far more squares of various flavors in
1718 Unicode than there are hearts or stars.
1720 o The consequence of these ambiguities of description and
1721 dependencies on distinctions that were, or were not, made in
1722 Unicode codings, is that symbols are a very poor basis for
1723 reliable communication. Of course, these difficulties with
1724 symbols do not arise with actual pictographic languages and
1725 scripts which would be treated like any other language characters;
1726 the two should not be confused.
1728 [[anchor36: Note in Draft: Should the above section be significantly
1729 trimmed or eliminated?]]
1731 10.6. Migration Between Unicode Versions: Unassigned Code Points
1733 In IDNA2003, labels containing unassigned code points are resolved on
1734 the theory that, if they appear in labels and can be resolved, the
1735 relevant standards must have changed and the registry has properly
1736 allocated only assigned values.
1738 In this specification, strings containing unassigned code points MUST
1739 NOT be either looked up or registered. There are several reasons for
1740 this, with the most important ones being:
1742 o It cannot be known with sufficient reliability in advance that a
1743 code point that was not previously assigned will not be assigned
1744 to a compatibility character. In IDNA2003, since there is no
1745 direct dependency on NFKC (Stringprep's tables are based on NFKC,
1746 but IDNA2003 depends only on Stringprep), allocation of a
1747 compatibility character might produce some odd situations, but it
1748 would not be a problem. In IDNA2008, where compatibility
1749 characters are generally assigned to DISALLOWED, permitting
1750 strings containing unassigned characters to be looked up would
1751 permit violating the principle that characters in DISALLOWED are
1752 not looked up.
1754 o More generally, the status of an unassigned character with regard
1755 to the DISALLOWED and PROTOCOL-VALID categories, and whether
1756 contextual rules are required with the latter, cannot be evaluated
1757 until a character is actually assigned and known.
1759 It is possible to argue that the issues above are not important and
1760 that, as a consequence, it is better to retain the principle of
1761 looking up labels even if they contain unassigned characters because
1762 all of the important scripts and characters have been coded as of
1763 Unicode 5.1 and hence unassigned code points will be assigned only to
1764 obscure characters or archaic scripts. Unfortunately, that does not
1765 appear to be a safe assumption for at least two reasons. First, much
1766 the same claim of completeness has been made for earlier versions of
1767 Unicode. The reality is that a script that is obscure to much of the
1768 world may still be very important to those who use it. Cultural and
1769 linguistic preservation principles make it inappropriate to declare
1770 the script of no importance in IDNs. Second, we already have
1771 counterexamples in, e.g., the relationships associated with new Han
1772 characters being added (whether in the BMP or in Unicode Plane 2).
1774 10.7. Other Compatibility Issues
1776 The existing (2003) IDNA model includes several odd artifacts of the
1777 context in which it was developed. Many, if not all, of these are
1778 potential avenues for exploits, especially if the registration
1779 process permits "source" names (names that have not been processed
1780 through IDNA and nameprep) to be registered. As one example, since
1781 the character Eszett, used in German, is mapped by IDNA2003 into the
1782 sequence "ss" rather than being retained as itself or prohibited, a
1783 string containing that character but that is otherwise in ASCII is
1784 not really an IDN (in the U-label sense defined above) at all. After
1785 Nameprep maps the Eszett out, the result is an ASCII string and so
1786 does not get an xn-- prefix, but the string that can be displayed to
1787 a user appears to be an IDN. The proposed IDNA2008 eliminates this
1788 artifact. A character is either permitted as itself or it is
1789 prohibited; special cases that make sense only in a particular
1790 linguistic or cultural context can be dealt with as localization
1791 matters where appropriate.
1793 11. Acknowledgments
1795 The editor and contributors would like to express their thanks to
1796 those who contributed significant early (pre-WG) review comments,
1797 sometimes accompanied by text, especially Mark Davis, Paul Hoffman,
1798 Simon Josefsson, and Sam Weiler. In addition, some specific ideas
1799 were incorporated from suggestions, text, or comments about sections
1800 that were unclear supplied by Frank Ellerman, Michael Everson, Asmus
1801 Freytag, Erik van der Poel, Michel Suignard, and Ken Whistler,
1802 although, as usual, they bear little or no responsibility for the
1803 conclusions the editor and contributors reached after receiving their
1804 suggestions. Thanks are also due to Vint Cerf, Debbie Garside, and
1805 Jefsey Morphin for conversations that led to considerable
1806 improvements in the content of this document.
1808 A meeting was held on 30 January 2008 to attempt to reconcile
1809 differences in perspective and terminology about this set of
1810 specifications between the design team and members of the Unicode
1811 Technical Consortium. The discussions at and subsequent to that
1812 meeting were very helpful in focusing the issues and in refining the
1813 specifications. The active participants at that meeting were (in
1814 alphabetic order as usual) Harald Alvestrand, Vint Cerf, Tina Dam,
1815 Mark Davis, Lisa Dusseault, Patrik Faltstrom (by telephone), Cary
1816 Karp, John Klensin, Warren Kumari, Lisa Moore, Erik van der Poel,
1817 Michel Suignard, and Ken Whistler. We express our thanks to Google
1818 for support of that meeting and to the participants for their
1819 contributions.
1821 Special thanks are due to Paul Hoffman for permission to extract
1822 material from his Internet-Draft to form the basis for Section 2.
1824 Useful comments and text on the WG versions of the draft were
1825 received from many participants in the IETF "IDNABIS" WG and a number
1826 of document changes resulted from mailing list discussions made by
1827 that group.
1829 12. Contributors
1831 While the listed editor held the pen, this core of this document and
1832 the initial WG version represents the joint work and conclusions of
1833 an ad hoc design team consisting of the editor and, in alphabetic
1834 order, Harald Alvestrand, Tina Dam, Patrik Faltstrom, and Cary Karp.
1835 In addition, there were many specific contributions and helpful
1836 comments from those listed in the Acknowledgments section and others
1837 who have contributed to the development and use of the IDNA
1838 protocols.
1840 13. IANA Considerations
1842 This section gives an overview of registries required for IDNA. The
1843 actual definition of the first one appears in [IDNA2008-Tables].
1845 13.1. IDNA Character Registry
1847 The distinction among the three major categories "UNASSIGNED",
1848 "DISALLOWED", and "PROTOCOL-VALID" is made by special categories and
1849 rules that are integral elements of [IDNA2008-Tables]. Convenience
1850 in programming and validation requires a registry of characters and
1851 scripts and their categories, updated for each new version of Unicode
1852 and the characters it contains. The details of this registry are
1853 specified in [IDNA2008-Tables].
1855 13.2. IDNA Context Registry
1857 For characters that are defined in the IDNA Character Registry list
1858 as PROTOCOL-VALID but requiring a contextual rule (i.e., the types of
1859 rule described in Section 6.1.1.1), IANA will create and maintain a
1860 list of approved contextual rules. Additions or changes to these
1861 rules require IETF Review, as described in [RFC5226].
1862 [[anchor41: Note in Draft: This section was changed between -00 and
1863 -01 based on list discussion. Consensus needs to be verified for
1864 that decision.]]
1866 A table from which that registry can be initialized, and some further
1867 discussion, appears in [RulesInit].
1868 [[anchor42: This subsection should probably be moved to Tables along
1869 with the Contextual rules themselves (from Protocol) when the move is
1870 made.]]
1872 13.3. IANA Repository of IDN Practices of TLDs
1874 This registry, historically described as the "IANA Language Character
1875 Set Registry" or "IANA Script Registry" (both somewhat misleading
1876 terms) is maintained by IANA at the request of ICANN. It is used to
1877 provide a central documentation repository of the IDN policies used
1878 by top level domain (TLD) registries who volunteer to contribute to
1879 it and is used in conjunction with ICANN Guidelines for IDN use.
1881 It is not an IETF-managed registry and, while the protocol changes
1882 specified here may call for some revisions to the tables, these
1883 specifications have no direct effect on that registry and no IANA
1884 action is required as a result.
1886 14. Security Considerations
1888 Security on the Internet partly relies on the DNS. Thus, any change
1889 to the characteristics of the DNS can change the security of much of
1890 the Internet.
1892 Domain names are used by users to identify and connect to Internet
1893 servers. The security of the Internet is compromised if a user
1894 entering a single internationalized name is connected to different
1895 servers based on different interpretations of the internationalized
1896 domain name.
1898 When systems use local character sets other than ASCII and Unicode,
1899 this specification leaves the problem of transcoding between the
1900 local character set and Unicode up to the application or local
1901 system. If different applications (or different versions of one
1902 application) implement different transcoding rules, they could
1903 interpret the same name differently and contact different servers.
1904 This problem is not solved by security protocols like TLS that do not
1905 take local character sets into account.
1907 To help prevent confusion between characters that are visually
1908 similar, it is suggested that implementations provide visual
1909 indications where a domain name contains multiple scripts. Such
1910 mechanisms can also be used to show when a name contains a mixture of
1911 simplified and traditional Chinese characters, or to distinguish zero
1912 and one from O and l. DNS zone administrators may impose
1913 restrictions (subject to the limitations identified elsewhere in this
1914 document) that try to minimize characters that have similar
1915 appearance or similar interpretations. It is worth noting that there
1916 are no comprehensive technical solutions to the problems of
1917 confusable characters. One can reduce the extent of the problems in
1918 various ways, but probably never eliminate it. Some specific
1919 suggestions about identification and handling of confusable
1920 characters appear in a Unicode Consortium publication
1921 [Unicode-UTR36].
1923 The registration and resolution models described above and in
1924 [IDNA2008-Protocol] change the mechanisms available for applications
1925 and resolvers to determine the validity of labels they encounter. In
1926 some respects, the ability to test is strengthened. For example,
1927 putative labels that contain unassigned code points will now be
1928 rejected, while IDNA2003 permitted them (something that is now
1929 recognized as a considerable source of risk). On the other hand, the
1930 protocol specification no longer assumes that the application that
1931 looks up a name will be able to determine, and apply, information
1932 about the protocol version used in registration. In theory, that may
1933 increase risk since the application will be able to do less pre-
1934 lookup validation. In practice, the protection afforded by that test
1935 has been largely illusory for reasons explained in RFC 4690 and
1936 above.
1938 Any change to Stringprep or, more broadly, the IETF's model of the
1939 use of internationalized character strings in different protocols,
1940 creates some risk of inadvertent changes to those protocols,
1941 invalidating deployed applications or databases, and so on. Our
1942 current hypothesis is that the same considerations that would require
1943 changing the IDN prefix (see Section 10.3.2) are the ones that would,
1944 e.g., invalidate certificates or hashes that depend on Stringprep,
1945 but those cases require careful consideration and evaluation. More
1946 important, it is not necessary to change Stringprep2003 at all in
1947 order to make the IDNA changes contemplated here. It is far
1948 preferable to create a separate document, or separate profile
1949 components, for IDN work, leaving the question of upgrading to other
1950 protocols to experts on them and eliminating any possible
1951 synchronization dependency between IDNA changes and possible upgrades
1952 to security protocols or conventions.
1954 No mechanism involving names or identifiers alone can protect a wide
1955 variety of security threats and attacks that are largely independent
1956 of them including spoofed pages, DNS query trapping and diversion,
1957 and so on.
1959 15. Change Log
1961 [[anchor45: RFC Editor: Please remove this section.]]
1963 For version 00 of draft-ietf-idnabis-rationale, this list contains a
1964 complete trace going back through the earlier, design team, drafts.
1965 Material earlier than that described in Section 15.9 will be removed
1966 in WG draft -02.
1968 15.1. Version -01 of draft-klensin-idnabis-issues
1970 Version -01 of this document is a considerable rewrite from -00.
1971 Many sections have been clarified or extended and several new
1972 sections have been added to reflect discussions in a number of
1973 contexts since -00 was issued.
1975 15.2. Version -02 of draft-klensin-idnabis-issues
1977 o Corrected several editorial errors including an accidentally-
1978 introduced misstatement about NFKC.
1980 o Extensively revised the document to synchronize its terminology
1981 with version 03 of [IDNA2008-Tables] and to provide a better
1982 conceptual framework for its categories and how they are used.
1983 Added new material to clarify terminology and relationships with
1984 other efforts. More subtle changes in this version lay the
1985 groundwork for separating the document into a conceptual overview
1986 and a protocol specification for version 03.
1988 15.3. Version -03 of draft-klensin-idnabis-issues
1990 o Removed protocol materials to a separate document and incorporated
1991 rationale and explanation materials from the original
1992 specification in RFC 3960 into this document. Cleaned up earlier
1993 text to reflect a more mature specification and restructured
1994 several sections and added additional rationale material.
1996 o Strengthened and clarified the A-label / U-label/ LDH-label
1997 definition.
1999 o Retitled the document to reflect its evolving role.
2001 15.4. Version -04 of draft-klensin-idnabis-issues
2003 o Moved more text from "protocol" and further reorganized material.
2005 o Provided new material on "Contextual Rule Required.
2007 o Improved consistency of terminology, both internally and with the
2008 "tables" document.
2010 o Improved the IANA Considerations section and discussed the
2011 existing IDNA-related registry.
2013 o More small changes to increase consistency.
2015 15.5. Version -05 of draft-klensin-idnabis-issues
2017 Changed "YES" category back to "ALWAYS" to re-synch with the tables
2018 document and provide clearer terminology.
2020 15.6. Version -06 of draft-klensin-idnabis-issues
2022 o Clarified the prohibitions on strings that look like A-labels but
2023 are not and on unassigned code points.
2025 o Clarified length restrictions on IDN labels.
2027 o Revised the terminology definitions to remove the impression of
2028 circularity and removed invocations of ToASCII and ToUnicode,
2029 which do not exist in IDNA2008.
2031 o Added a new section on front-end processing.
2033 o Added a new section to discuss case-mapping.
2035 o Extended the discussion of prefix changes to identify the
2036 implications of making one.
2038 o Several more editorial improvements, corrected references, and
2039 similar adjustments.
2041 15.7. Version -07 of draft-klensin-idnabis-issues
2043 o Added material that specifically defines the format of contextual
2044 rules.
2046 o Added and altered text after discussions at the 30 January meeting
2047 (see Section 11) and the follow-up to those discussions. Among
2048 the key decisions at that meeting were to eliminate the
2049 distinction among the valid categories (formerly "ALWAYS", "MAYBE
2050 YES", and "MAYBE NO"), to adjust the terminology accordingly, and
2051 to change "CONTEXTUAL RULE REQUIRED" from a separate category in
2052 this document and the protocol one to a modifier of what is now
2053 called "PROTOCOL-VALID". The consequent changes resulted in
2054 removal of several sections of explanation from this document.
2056 o Resynchronized terminology with "protocol" and "tables" documents.
2058 o More editorial and typographic corrections.
2060 15.8. Version -00 of draft-ietf-idnabis-rationale
2062 o Rewrote the abstract and introduction, and retuned the title, to
2063 be more consistent with WG work and activities. Changed the file
2064 name to reflect WG naming.
2066 o Removed most of the material that explained, or compared this
2067 approach to, IDNA2003. Some of this material may appear in the
2068 non-WG "IDNA-alternatives" draft if it is ever completed.
2070 o Changed IDNA200X in terminology and references to IDNA2008.
2072 o Added a contextual rule for hyphen to the appendix, adjusted the
2073 rule syntax slightly, and supplied draft regular expression rules.
2075 o Responded to comments produced during the WG charter discussions
2076 and from several individuals. In general, comments requesting a
2077 reorganization of the collection of documents have not been
2078 responded to pending a WG decision on that topic.
2080 o Moved the contextual rule appendix out of here and into
2081 "Protocol". It may not belong there either, but definitely does
2082 not belong here, and was holding up getting this document out.
2084 o Many small editorial improvements, including reorganization of
2085 some material.
2087 Editorial note: While several sections have been removed from this
2088 version, the WG should discuss whether further cuts are desirable,
2089 e.g., whether Section 7.3, Section 7.4, or Section 10.3 provide
2090 enough value to be worth retaining? Can Section 10.4 be trimmed
2091 without loss of useful information and, if so, how? Section 10.7
2092 appears critical of IDNA2003 in undesirable ways: should it be
2093 dropped or do people have suggestions about how to improve it?
2094 Strong opinions have been expressed that Section 10.5 should be
2095 trimmed significantly or removed entirely. The WG will need to
2096 discuss that too. Are there other materials that should be trimmed
2097 out?
2099 15.9. Version -01 of draft-ietf-idnabis-rationale
2101 o Clarified the U-label definition to note that U-labels must
2102 contain at least one non-ASCII character. Also clarified the
2103 relationship among label types.
2105 o Rewrote the discussion of Labels in Registration (Section 10.1.2)
2106 and related text in Section 1.5.4.1.1 to narrow its focus and
2107 remove more general restrictions. Added a temporary note in line
2108 to explain the situation.
2110 o Changed the "IDNA uses Unicode" statement to focus on
2111 compatibility with IDNA2003 and avoid more general or
2112 controversial assertions.
2114 o Added a discussion of examples to Section 10.1
2116 o Made a number of other small editorial changes and corrections
2117 suggested by Mark Davis.
2119 o Added several more discussion anchors and notes and expanded or
2120 updated some existing ones.
2122 16. References
2124 16.1. Normative References
2126 [ASCII] American National Standards Institute (formerly United
2127 States of America Standards Institute), "USA Code for
2128 Information Interchange", ANSI X3.4-1968, 1968.
2130 ANSI X3.4-1968 has been replaced by newer versions with
2131 slight modifications, but the 1968 version remains
2132 definitive for the Internet.
2134 [IDNA2008-Bidi]
2135 Alvestrand, H. and C. Karp, "An updated IDNA criterion for
2136 right to left scripts", July 2008, .
2139 [IDNA2008-Protocol]
2140 Klensin, J., "Internationalized Domain Names in
2141 Applications (IDNA): Protocol", July 2008, .
2145 [IDNA2008-Tables]
2146 Faltstrom, P., "The Unicode Code Points and IDNA",
2147 May 2008, .
2150 A version of this document is available in HTML format at
2151 http://stupid.domain.name/idnabis/
2152 draft-ietf-idnabis-tables-01.html
2154 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
2155 Requirement Levels", BCP 14, RFC 2119, March 1997.
2157 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
2158 Internationalized Strings ("stringprep")", RFC 3454,
2159 December 2002.
2161 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
2162 "Internationalizing Domain Names in Applications (IDNA)",
2163 RFC 3490, March 2003.
2165 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
2166 Profile for Internationalized Domain Names (IDN)",
2167 RFC 3491, March 2003.
2169 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode
2170 for Internationalized Domain Names in Applications
2171 (IDNA)", RFC 3492, March 2003.
2173 [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an
2174 IANA Considerations Section in RFCs", BCP 26, RFC 5226,
2175 May 2008.
2177 [RulesInit]
2178 Klensin, J., "Internationalizing Domain Names in
2179 Applications (IDNA): Protocol, Appendix A Contextual Rules
2180 Table", July 2008, .
2183 [Unicode51]
2184 The Unicode Consortium, "The Unicode Standard, Version
2185 5.1.0", 2008.
2187 defined by: The Unicode Standard, Version 5.0, Boston, MA,
2188 Addison-Wesley, 2007, ISBN 0-321-48091-0, as amended by
2189 Unicode 5.1.0
2190 (http://www.unicode.org/versions/Unicode5.1.0/).
2192 16.2. Informative References
2194 [BIG5] Institute for Information Industry of Taiwan, "Computer
2195 Chinese Glyph and Character Code Mapping Table, Technical
2196 Report C-26", 1984.
2198 There are several forms and variations and a closely-
2199 related standard, CNS 11643. See the discussion in
2200 Chapter 3 of Lunde, K., CJKV Information Processing,
2201 O'Reilly & Associates, 1999
2203 [GB18030] "Chinese National Standard GB 18030-2000: Information
2204 Technology -- Chinese ideograms coded character set for
2205 information interchange -- Extension for the basic set.",
2206 2000.
2208 [RFC0810] Feinler, E., Harrenstien, K., Su, Z., and V. White, "DoD
2209 Internet host table specification", RFC 810, March 1982.
2211 [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
2212 host table specification", RFC 952, October 1985.
2214 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
2215 STD 13, RFC 1034, November 1987.
2217 [RFC1035] Mockapetris, P., "Domain names - implementation and
2218 specification", STD 13, RFC 1035, November 1987.
2220 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application
2221 and Support", STD 3, RFC 1123, October 1989.
2223 [RFC2782] Gulbrandsen, A., Vixie, P., and L. Esibov, "A DNS RR for
2224 specifying the location of services (DNS SRV)", RFC 2782,
2225 February 2000.
2227 [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint
2228 Engineering Team (JET) Guidelines for Internationalized
2229 Domain Names (IDN) Registration and Administration for
2230 Chinese, Japanese, and Korean", RFC 3743, April 2004.
2232 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource
2233 Identifiers (IRIs)", RFC 3987, January 2005.
2235 [RFC4290] Klensin, J., "Suggested Practices for Registration of
2236 Internationalized Domain Names (IDN)", RFC 4290,
2237 December 2005.
2239 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and
2240 Recommendations for Internationalized Domain Names
2241 (IDNs)", RFC 4690, September 2006.
2243 [RFC4713] Lee, X., Mao, W., Chen, E., Hsu, N., and J. Klensin,
2244 "Registration and Administration Recommendations for
2245 Chinese Domain Names", RFC 4713, October 2006.
2247 [Unicode-UTR36]
2248 The Unicode Consortium, "Unicode Technical Report #36:
2249 Unicode Security Considerations", August 2006,
2250 .
2252 Author's Address
2254 John C Klensin
2255 1770 Massachusetts Ave, Ste 322
2256 Cambridge, MA 02140
2257 USA
2259 Phone: +1 617 245 1457
2260 Email: john+ietf@jck.com
2262 Full Copyright Statement
2264 Copyright (C) The IETF Trust (2008).
2266 This document is subject to the rights, licenses and restrictions
2267 contained in BCP 78, and except as set forth therein, the authors
2268 retain all their rights.
2270 This document and the information contained herein are provided on an
2271 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
2272 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
2273 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
2274 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
2275 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
2276 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
2278 Intellectual Property
2280 The IETF takes no position regarding the validity or scope of any
2281 Intellectual Property Rights or other rights that might be claimed to
2282 pertain to the implementation or use of the technology described in
2283 this document or the extent to which any license under such rights
2284 might or might not be available; nor does it represent that it has
2285 made any independent effort to identify any such rights. Information
2286 on the procedures with respect to rights in RFC documents can be
2287 found in BCP 78 and BCP 79.
2289 Copies of IPR disclosures made to the IETF Secretariat and any
2290 assurances of licenses to be made available, or the result of an
2291 attempt made to obtain a general license or permission for the use of
2292 such proprietary rights by implementers or users of this
2293 specification can be obtained from the IETF on-line IPR repository at
2294 http://www.ietf.org/ipr.
2296 The IETF invites any interested party to bring to its attention any
2297 copyrights, patents or patent applications, or other proprietary
2298 rights that may cover technology that may be required to implement
2299 this standard. Please address the information to the IETF at
2300 ietf-ipr@ietf.org.