idnits 2.17.1
draft-ietf-idn-idna-04.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
** Looks like you're using RFC 2026 boilerplate. This must be updated to
follow RFC 3978/3979, as updated by RFC 4748.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
** Missing expiration date. The document expiration date should appear on
the first and last page.
== No 'Intended status' indicated for this document; assuming Proposed
Standard
== The page length should not exceed 58 lines per page, but there was 1
longer page, the longest (page 1) being 550 lines
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
** The document seems to lack an IANA Considerations section. (See Section
2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
when there are no actions for IANA.)
** There are 3 instances of too long lines in the document, the longest one
being 3 characters in excess of 72.
** The document seems to lack a both a reference to RFC 2119 and the
recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
keywords.
RFC 2119 keyword, line 147: '...label MUST contain only ASCII characte...'
RFC 2119 keyword, line 151: '...2) ACE labels SHOULD be hidden from us...'
RFC 2119 keyword, line 157: '...e compared, they MUST be considered to...'
RFC 2119 keyword, line 249: '...ollowed by two hyphen-minuses. It MUST...'
RFC 2119 keyword, line 307: '...CE label. Applications MAY allow input...'
(13 more instances...)
Miscellaneous warnings:
----------------------------------------------------------------------------
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- Couldn't find a document date in the document -- date freshness check
skipped.
Checking references for intended status: Proposed Standard
----------------------------------------------------------------------------
(See RFCs 3967 and 4897 for information about using normative references
to lower-maturity documents in RFCs)
-- Missing reference section? 'RFC2119' on line 468 looks like a reference
-- Missing reference section? 'UNICODE' on line 482 looks like a reference
-- Missing reference section? 'STD13' on line 475 looks like a reference
-- Missing reference section? 'NAMEPREP' on line 465 looks like a reference
-- Missing reference section? 'AMC-ACE-Z' on line 462 looks like a reference
-- Missing reference section? 'STD3' on line 471 looks like a reference
-- Missing reference section? 'UAX9' on line 479 looks like a reference
Summary: 5 errors (**), 0 flaws (~~), 2 warnings (==), 9 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
1 Internet Draft Patrik Faltstrom
2 draft-ietf-idn-idna-04.txt Cisco
3 November 8, 2001 Paul Hoffman
4 Expires in six months IMC & VPNC
5 Adam M. Costello
6 UC Berkeley
8 Internationalizing Host Names in Applications (IDNA)
10 Status of this Memo
12 This document is an Internet-Draft and is in full conformance with all
13 provisions of Section 10 of RFC2026.
15 Internet-Drafts are working documents of the Internet Engineering Task
16 Force (IETF), its areas, and its working groups. Note that other groups
17 may also distribute working documents as Internet-Drafts.
19 Internet-Drafts are draft documents valid for a maximum of six months
20 and may be updated, replaced, or obsoleted by other documents at any
21 time. It is inappropriate to use Internet-Drafts as reference material
22 or to cite them other than as "work in progress."
24 The list of current Internet-Drafts can be accessed at
25 http://www.ietf.org/ietf/1id-abstracts.txt
27 The list of Internet-Draft Shadow Directories can be accessed at
28 http://www.ietf.org/shadow.html.
30 Abstract
32 Until now, there has been no standard way for host names to use
33 characters outside the ASCII repertoire. This document describes a
34 mechanism called IDNA that enables internationalized host names,
35 that is, host names that use characters drawn from a much larger
36 repertoire. (The "D" in the name originally stood for "domain",
37 but the work is actually focused on host names, so the word
38 "host" is used throughout this document.)
40 1. Introduction
42 IDNA works by allowing applications to use certain ASCII name labels
43 (beginning with a special prefix) to represent non-ASCII name labels.
44 Lower-layer protocols need not be aware of this; therefore IDNA does not
45 require changes to any infrastructure. In particular, IDNA does not
46 require any changes to DNS servers, resolvers, or protocol elements,
47 because the ASCII name service provided by the existing DNS is entirely
48 sufficient.
50 This document does not require any applications to conform to IDNA,
51 but applications can elect to use IDNA in order to support IDN while
52 maintaining interoperability with existing infrastructure. Adding IDNA
53 support to an existing application entails changes to the application
54 only, and leaves room for flexibility in the user interface.
56 A great deal of the discussion of IDN solutions has focused on
57 transition issues and how IDN will work in a world where not all of the
58 components have been updated. Other proposals would require that user
59 applications, resolvers, and DNS servers be updated in order for a user
60 to use an internationalized host name. Rather than require widespread
61 updating of all components, IDNA requires only user applications to be
62 updated; no changes are needed to the DNS protocol or any DNS servers or
63 the resolvers on user's computers.
65 This document is being discussed on the ietf-idna@mail.apps.ietf.org
66 mailing list. To subscribe, send a message to
67 ietf-idna-request@mail.apps.ietf.org with the single word "subscribe" in
68 the body of the message.
70 2 Terminology
72 [[ Editor's note: the author's are considering changing "host name" to
73 "domain name" throughout the document after discussing this further
74 with the DNS experts. ]]
76 The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
77 "MAY" in this document are to be interpreted as described in RFC 2119
78 [RFC2119].
80 A code point is an integral value associated with a character in a coded
81 character set.
83 Unicode [UNICODE] is a coded character set containing tens of thousands
84 of characters. A single Unicode code point is denoted by "U+" followed
85 by four to six hexadecimal digits, while a range of Unicode code points
86 is denoted by two hexadecimal numbers separated by "..", with no
87 prefixes.
89 ASCII means US-ASCII, a coded character set containing 128 characters
90 associated with code points in the range 0..7F. Unicode is an extension
91 of ASCII: it includes all the ASCII characters and associates them with
92 the same code points.
94 The term "LDH code points" is defined in this document to mean the code
95 points associated with ASCII letters, digits, and the hyphen-minus; that
96 is, U+002D, 30..39, 41..5A, and 61..7A. "LDH" is an abbreviation for
97 "letters, digits, hyphen".
99 A host label is an individual part of a host name. Host labels are
100 usually shown separated by dots; for example, the host name
101 "www.example.com" is composed of three host labels: "www", "example",
102 and "com". In IDNA, not all text strings can be host labels. A string
103 can be a host label if and only if the ToASCII operation (see section 4)
104 does not fail when applied to it. (The zero-length root label that is
105 implied in host names, as described in [STD13], is not considered a
106 label in this specification.)
108 An "ACE label" is defined in this document to be a host label that
109 contains only ASCII characters but represents a label containing
110 non-ASCII characters (ACE stands for "ASCII-compatible encoding").
111 Internationalized host labels generally contain non-ASCII characters,
112 but for every host label that cannot be directly represented in ASCII
113 there is an equivalent ACE label. The conversion of host labels to and
114 from the ACE form is specified in section 4.
116 The "ACE prefix" is defined in this document to be a string of ASCII
117 characters that appears at the beginning of every ACE label. It is
118 specified in section 5.
120 A "host name slot" is defined in this document to be a protocol element
121 or a function argument or a return value (and so on) explicitly
122 designated for carrying a host name. Examples of host name slots
123 include: the QNAME field of a DNS query; the name argument of the
124 gethostbyname() library function; the part of an email address following
125 the at-sign (@) in the From: field of an email message header; and the host
126 portion of the URI in the src attribute of an HTML tag.
127 General text that just happens to contain a host name is not a host name
128 slot; for example, a host name appearing in the plain text body of an
129 email message is not occupying a host name slot.
131 An "internationalized host name slot" is defined in this document to be
132 a host name slot explicitly designated for carrying an internationalized
133 host name as described in this document. The designation may be static
134 (for example, in the specification of the protocol or interface) or
135 dynamic (for example, as a result of negotiation in an interactive
136 session).
138 A "generic host name slot" is defined in this document to be any host
139 name slot that is not an internationalized host name slot. Obviously,
140 this includes any host name slot whose specification predates IDNA.
142 3. Requirements
144 IDNA conformance means adherence of the following three rules:
146 1) Whenever a host name is put into a generic host name slot, every
147 label MUST contain only ASCII characters. Given any host name, an
148 equivalent host name satisfying this requirement can be obtained by
149 applying the ToASCII operation (see section 4) to each label.
151 2) ACE labels SHOULD be hidden from users whenever possible. Therefore,
152 before a host name is displayed to a user or is output into a context
153 likely to be viewed by users, the ToUnicode operation (see section 4)
154 SHOULD be applied to each label. When requirements 1 and 2 both apply,
155 requirement 1 takes precedence.
157 3) Whenever two host labels are compared, they MUST be considered to
158 match if and only if their ASCII forms (obtained by applying ToASCII)
159 match using a case-insensitive ASCII comparison.
161 4. Conversion operations
163 This section specifies the ToASCII and ToUnicode operations. Each one
164 operates on a sequence of Unicode code points (but remember that all
165 ASCII code points are also Unicode code points). When host names are
166 represented using character sets other than Unicode and ASCII, they will
167 need to first be transcoded to Unicode before these operations can be
168 applied, and might need to be transcoded back afterwards.
170 4.1 ToASCII
172 The ToASCII operation takes a sequence of Unicode code points and
173 transforms it into a sequence of code points in the ASCII range (0..7F).
174 The original sequence and the resulting sequence are equivalent host
175 labels.
177 ToASCII fails if any step of it fails. Failure means that the original
178 sequence cannot be used as a host label.
180 ToASCII never alters a sequence of code points that are all in the ASCII
181 range to begin with (although it may fail).
183 ToASCII consists of the following steps:
185 1. If all code points in the sequence are in the ASCII range (0..7F)
186 then skip to step 3.
188 2. Perform the steps specified in [NAMEPREP].
190 3. Host-specific restrictions:
191 Host names have additional restrictions:
193 * Verify the absence of non-LDH ASCII code points; that is, the
194 absence of 0..2C, 2E..2F, 3A..40, 5B..60, and 7B..7F.
196 * Verify the absence of leading and trailing hyphen-minus; that
197 is, the absence of U+002D at the beginning and end of the
198 sequence.
200 4. If all code points in the sequence are in the ASCII range (0..7F),
201 then skip to step 8.
203 5. Verify that the sequence does NOT begin with the ACE prefix.
205 6. Encode the sequence using the encoding algorithm in [AMC-ACE-Z].
207 7. Prepend the ACE prefix.
209 8. Verify that the number of code points is in the range 1 to 63
210 inclusive.
212 4.2 ToUnicode
214 The ToUnicode operation takes a sequence of Unicode code points and
215 returns a sequence of Unicode code points. If the input sequence is a
216 host label in ACE form, then the result is an equivalent host label
217 that is not in ACE form, otherwise the original sequence is returned
218 unaltered.
220 ToUnicode never fails. If any step fails, then the original input
221 sequence is returned immediately in that step.
223 1. If all code points in the sequence are in the ASCII range (0..7F)
224 then skip to step 3.
226 2. Perform the steps specified in [NAMEPREP]. (If step 3
227 of ToASCII is also performed here, it will not affect the
228 overall behavior of ToUnicode, but it is not necessary.)
230 3. Verify that the sequence begins with the ACE prefix, and save a
231 copy of the sequence.
233 4. Remove the ACE prefix.
235 5. Decode the sequence using decoding algorithm in [AMC-ACE-Z]. Save
236 a copy of the result of this step.
238 6. Apply ToASCII.
240 7. Verify that the sequence matches the saved copy from step 3, using
241 a case-insensitive ASCII comparison.
243 8. Return the saved copy from step 5.
245 5. ACE prefix
247 The ACE prefix, used in the conversion operations (section 4), will
248 be specified in a future revision of this document. It will be two
249 alphanumeric ASCII characters followed by two hyphen-minuses. It MUST
250 be recognized in a case-insensitive manner.
252 For example, the eventual ACE prefix might be the string "jk--". In this
253 case, an ACE label might be "jk--r3c2a-qc902xs", where "r3c2a-qc902xs"
254 is the part of the ACE label that is generated by the encoding steps in
255 [AMC-ACE-Z].
257 6. Implications for typical applications using DNS
259 In IDNA, applications perform the processing needed to input
260 internationalized host names from users, display internationalized
261 host names to users, and process the inputs and outputs from DNS and
262 other protocols that carry host names.
264 The components and interfaces between them can be represented
265 pictorially as:
267 +------+
268 | User |
269 +------+
270 ^
271 | Input and display: local interface methods
272 | (pen, keyboard, glowing phosphorus, ...)
273 +-------------------|-------------------------------+
274 | v |
275 | +-----------------------------+ |
276 | | Application | |
277 | | (conversion between local | |
278 | | character set and Unicode | |
279 | | is done here) | |
280 | +-----------------------------+ |
281 | ^ ^ | End system
282 | | | |
283 | Call to resolver: | | Application-specific |
284 | ACE | | protocol: |
285 | v | predefined by the |
286 | +----------+ | protocol or defaults |
287 | | Resolver | | to ACE |
288 | +----------+ | |
289 | ^ | |
290 +-----------------|----------|----------------------+
291 DNS protocol: | |
292 ACE | |
293 v v
294 +-------------+ +---------------------+
295 | DNS servers | | Application servers |
296 +-------------+ +---------------------+
298 6.1 Entry and display in applications
300 Applications can accept host names using any character set or sets
301 desired by the application developer, and can display host names in any
302 charset. That is, the IDNA protocol does not affect the interface
303 between users and applications.
305 An IDNA-aware application can accept and display internationalized host
306 names in two formats: the internationalized character set(s) supported
307 by the application, and as an ACE label. Applications MAY allow input
308 and display of ACE labels, but are not encouraged to do so except as an
309 interface for special purposes, possibly for debugging. ACE encoding is
310 opaque and ugly, and should thus only be exposed to users who absolutely
311 need it. The optional use, especially during a transition period, of ACE
312 encodings in the user interface is described in section 6.4. Because
313 name labels encoded as ACE name labels can be rendered either as the
314 encoded ASCII characters or the proper decoded characters, the
315 application MAY have an option for the user to select the preferred
316 method of display; if it does, rendering the ACE SHOULD NOT be the
317 default.
319 Host names are often stored and transported in many places. For example,
320 they are part of documents such as mail messages and web pages. They are
321 transported in the many parts of many protocols, such as both the
322 control commands and the RFC 2822 body parts of SMTP, and the headers
323 and the body content in HTTP. It is important to remember that host
324 names appear both in host name slots and in the content that is passed
325 over protocols.
327 In protocols and document formats that define how to handle
328 specification or negotiation of charsets, IDN host name labels can be
329 encoded in any charset allowed by the protocol or document format. If a
330 protocol or document format only allows one charset, IDN host name
331 labels MUST be given in that charset. In any place where a protocol or
332 document format allows transmission of the characters in IDN host name
333 labels, IDN host name labels SHOULD be transmitted using whatever
334 character encoding and escape mechanism that the protocol or document
335 format uses at that place.
337 All protocols that have generic host name slots already have the
338 capacity for handling host names in the ASCII charset. Thus, IDN host
339 name labels that have been processed with the ToASCII operation can
340 inherently be handled by those protocols.
342 6.2 Applications and resolvers
344 Applications communicate with resolver libraries through a programming
345 interface (API). Typically, the IETF does not standardize APIs, although
346 there are non-standard APIs specified for IPv6. This protocol does not
347 specify a specific API, but instead specifies the operations that must
348 be used for input to and output from the resolver library.
350 An application MUST prepapre name parts that are sent in the DNS
351 protocol using the ToASCII operation. Internationalized labels received
352 from the resolver will always be in ACE form. IDNA-aware applications
353 MUST be able to work with both non-internationalized host name labels
354 (those that conform to [STD13] and [STD3]) and internationalized host
355 name labels.
357 6.3 Resolvers and DNS servers
359 An operating system might have a set of libraries for performing the
360 ToASCII operation. The input to such a library might be in one or more
361 charsets that are used in applications (UTF-8 and UTF-16 are likely
362 candidates for almost any operating system, and script-specific charsets
363 are likely for localized operating systems).
365 DNS servers MUST use the ACE format for internationalized host labels.
366 All internationalized names stored in DNS servers must be valid names
367 that have been processed with the ToASCII operation.
369 If a signalling system which makes negotiation possible between old and
370 new DNS clients and servers is standardized in the future, the encoding
371 of the query in the DNS protocol itself can be changed from ACE to
372 something else, such as UTF-8. The question whether or not this should
373 be used is, however, a separate problem and is not discussed in this
374 memo.
376 6.4 Avoiding exposing users to the raw ACE encoding
378 All applications that might show the user a host name that was received
379 from a gethostbyaddr or other such lookup SHOULD update as soon as
380 possible in order to prevent users from seeing the ACE. However, this is
381 not considered a big problem because so few applications show this type
382 of resolution to users.
384 If an application decodes an ACE name using ToUnicode but cannot show
385 all of the characters in the decoded name, such as if the name contains
386 characters that the output system cannot display, the application SHOULD
387 show the name in ACE format instead of displaying the name with the
388 replacement character (U+FFFD). This is to make it easier for the user
389 to transfer the name correctly to other programs. Programs that by
390 default show the ACE form when they cannot show all the characters in a
391 name label SHOULD also have a mechanism to show the name that is
392 produced by the ToUnicode operation with as many characters as possible
393 and replacement characters in the positions where characters cannot be
394 displayed. In many cases, the application doesn't know exactly what the
395 underlying rendering engine can or cannot display.
397 In addition to the condition above, if an application receives an ACE
398 host name after performing the ToUnicode operation, meaning that the
399 name was not properly prepared with ToASCII (for example, if it has
400 illegal characters in it), the application MUST show the name in ACE
401 format because the ToUnicode operation never fails, but returns the
402 original input if errors are detected at any step.
404 6.5 Bidirectional text in host names
406 The display of host names that contain bidirectional text is not covered
407 in this document. It may be covered in a future version of this
408 document, or may be covered in a different document.
410 For developers interested in displaying host names that have
411 bidirectional text, the Unicode standard has an extensive discussion of
412 how to deal with reorder glyphs for display when dealing with
413 bidirectional text such as Arabic or Hebrew. See [UAX9] for more
414 information. In particular, all Unicode text is stored in logical order.
416 7. Name Server Considerations
418 Internationalized host name data in zone files (as specified by section
419 5 of RFC 1035) MUST be processed with ToASCII before it is entered in
420 the zone files.
422 It is imperative that there be only one ASCII encoding for a particular
423 host name. ACE is an encoding for host name labels that use non-ASCII
424 characters. Thus, a primary master name server MUST NOT contain an
425 ACE-encoded label that decodes to an ASCII label. The ToASCII operation
426 assures that no such names are ever output from the operation.
428 Name servers MUST NOT have any records with host names that contain
429 internationalized name labels unless those name labels have be prepared
430 with the ToASCII operation. If names that are not processed by ToASCII
431 are passed to an application, it will result in unpredictable behavior.
432 Note that [NAMEPREP] describes how to handle versioning of unallocated
433 codepoints.
435 8. Root Server Considerations
437 Because there are no changes to the DNS protocols, adopting this
438 protocol has no effect on the DNS root servers.
440 9. Security Considerations
442 Much of the security of the Internet relies on the DNS. Thus, any change
443 to the characteristics of the DNS can change the security of much of the
444 Internet.
446 This memo describes an algorithm which encodes characters that are not
447 valid according to STD3 and STD13 into octet values that are valid. No
448 security issues such as string length increases or new allowed values
449 are introduced by the encoding process or the use of these encoded
450 values, apart from those introduced by the ACE encoding itself.
452 Host names are used by users to connect to Internet servers. The
453 security of the Internet would be compromised if a user entering a
454 single internationalized name could be connected to different servers
455 based on different interpretations of the internationalized host name.
457 Because this document normatively refers to [NAMEPREP], it includes the
458 security considerations from that document as well.
460 A. References
462 [AMC-ACE-Z] Adam Costello, "AMC-ACE-Z version 0.3.1",
463 draft-ietf-idn-amc-ace-z.
465 [NAMEPREP] Paul Hoffman and Marc Blanchet, "Preparation of
466 Internationalized Host Names", draft-ietf-idn-nameprep.
468 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
469 Requirement Levels", March 1997, RFC 2119.
471 [STD3] Bob Braden, "Requirements for Internet Hosts -- Communication
472 Layers" (RFC 1122) and "Requirements for Internet Hosts -- Application
473 and Support" (RFC 1123), STD 3, October 1989.
475 [STD13] Paul Mockapetris, "Domain names - concepts and facilities" (RFC
476 1034) and "Domain names - implementation and specification" (RFC 1035,
477 STD 13, November 1987.
479 [UAX9] Unicode Standard Annex #9, The Bidirectional Algorithm.
480 http://www.unicode.org/unicode/reports/tr9/
482 [UNICODE] The Unicode Standard, Version 3.1.0: The Unicode Consortium.
483 The Unicode Standard, Version 3.0. Reading, MA, Addison-Wesley
484 Developers Press, 2000. ISBN 0-201-61633-5, as amended by: Unicode
485 Standard Annex #27: Unicode 3.1
486 .
488 B. Design philosophy
490 Many proposals for IDN protocols have required that DNS servers be
491 updated to handle internationalized host names. Because of this, a
492 person who wanted to use an internationalized host name would have to be
493 sure that their request went to a DNS server that had been updated for
494 IDN. Further, that server could send queries only to other servers that
495 had been updated for IDN, because the queries contain new protocol
496 elements to differentiate IDN name labels from current host labels. In
497 addition, these proposals require that resolvers be updated to use the
498 new protocols, and in most cases the applications would need to be
499 updated as well.
501 These proposals would require changes to the application protocols that
502 use host names as protocol elements, because of the assumptions and
503 requirements made in those protocols about the characters that have
504 always been used for host names, and the encoding of those characters.
505 Other proposals for IDN protocols do not require changes to DNS servers
506 but still require changes to most application protocols to handle the
507 new names.
509 Updating all (or even a significant percentage) of the existing servers
510 in the world will be difficult, to say the least. Updating applications,
511 application gateways, and clients to handle changes to the application
512 protocols is also daunting. Because of this, we have designed a protocol
513 that requires no updating of any name servers. IDNA still requires the
514 updating of applications, but only for input and display of names, not
515 for changes to the protocols. Once users have updated the applications,
516 they can immediately start using internationalized host names. The cost
517 of implementing IDN may thus be much lower, and the speed of
518 implementation could be much higher.
520 C. Authors' Addresses
522 Patrik Faltstrom
523 Cisco Systems
524 Arstaangsvagen 31 J
525 S-117 43 Stockholm Sweden
526 paf@cisco.com
528 Paul Hoffman
529 Internet Mail Consortium and VPN Consortium
530 127 Segre Place
531 Santa Cruz, CA 95060 USA
532 phoffman@imc.org
534 Adam M. Costello
535 University of California, Berkeley
536 idna-spec.amc @ nicemice.net