INTERNET-DRAFT Eric A. Hall, Editor Document: draft-hall-dm-idns-00.txt Consultant Expires: May 2002 November 2001 The Internationalized Domain Name System Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. 1. Abstract The principle intention of this specification is to facilitate the deployment of a completely internationalized domain name syntax and service which new protocols, applications and host systems can use, but without disrupting the existing infrastructure. Towards that end, this document describes a series of elective encapsulation services and protocol extensions which cumulatively allow internationalized domain names to be stored and transmitted in the existing DNS message and within application data streams, according to the compliance level of the participating systems. INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 Table of Contents 1. Abstract..................................................1 2. Definitions and Terminology...............................3 3. Introduction..............................................4 3.1. Background.............................................4 3.2. Objectives.............................................5 3.3. Common Usage Scenarios.................................7 3.4. User Audiences.........................................9 3.5. Service Overview......................................11 3.6. Process Example.......................................13 4. The Internationalized Namespace..........................19 4.1. Internationalized Domain Names and Labels.............20 4.2. Internationalized Host Identifiers....................27 4.3. STD13 Domain Names....................................28 4.4. STD13 Host Identifiers................................29 5. Transfer Encodings and Label Types.......................30 5.1. The EDNS/UTF-8 Label Type.............................31 5.2. The STD13 Legacy Label Type...........................33 6. Application Guidelines...................................36 6.1. Input and Output Charsets.............................37 6.2. Protocol and Application Data.........................38 6.3. DNS Lookups and Resolver Calls........................40 7. Resolver Guidelines......................................42 7.1. Resolver APIs.........................................42 7.2. Query Processing Services.............................44 7.3. The Hosts Database....................................48 8. Server Guidelines........................................49 8.1. Internationalized Zones...............................50 8.2. Namespace Visibility Restrictions.....................51 8.3. The Master File Format................................52 9. Caching Guidelines.......................................53 10. Security Considerations..................................53 11. IANA Considerations......................................54 12. References...............................................54 13. Acknowledgements.........................................55 14. Editor's Address.........................................55 Hall I-D Expires: May 2002 [page 2] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 2. Definitions and Terminology This document unites, enhances and clarifies several pre-existing technologies. Readers are expected to be familiar with the following specifications: [AMC-ACE-Z] , "AMC-ACE-Z version 0.3.1" [NAMEPREP] , "Preparation of Internationalized Host Names" [STD13] (RFC 1034) "Domain names - concepts and facilities", (RFC 1035) "Domain names - implementation and specification" [STD3] (RFC 1122) "Requirements for Internet Hosts -- Communication Layers", (RFC1123) "Requirements for Internet Hosts -- Application and Support" [BCP18] (RFC 2277) "IETF Policy on Character Sets and Languages" [RFC2279] "UTF-8, a transformation format of ISO 10646" [RFC2671] "Extension Mechanisms for DNS (EDNS0)" The following abbreviations are used throughout this document: UCS (Universal Character Set) - The ISO/IEC 10646 character set repertoire, as represented by the Unicode 3.1 specification. ACE (ASCII-Compatible Encoding) - A transfer encoding which encodes UCS character codes into a seven-bit codespace which is compatible with US-ASCII. UTF-8 (UCS Transformation Format, Eight-Bit) - A transfer encoding which encodes UCS characters into an eight-bit codespace which is compatible with DNS message formats. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. Hall I-D Expires: May 2002 [page 3] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 3. Introduction The domain name system (DNS) [STD13] currently defines a message, namespace and protocol. Although the DNS message is capable of transferring eight-bit character codes as protocol data, applications are currently limited to a subset of US-ASCII when they interact with the DNS namespace, and this restricted syntax is enforced by almost every TCP/IP application and protocol which utilizes domain names as embedded data (including, surprisingly, the DNS protocol). In order to allow for the use of a larger range of characters in the namespace, this document extends and clarifies a variety of Internet specifications so that characters from the Universal Character Set (UCS) [ISO10646] may be used in domain names. This document also extends the DNS message structure to allow for the use of UTF-8 [RFC2279] encoded characters for the purpose of transferring these domain names, but also provides an ASCII- compatible encoding (ACE) [AMC-ACE-Z] of these character codes which existing protocols and applications can use to access the internationalized domain names, and also provides identification mechanisms which allow the end-point systems to downwardly negotiate when needed. Finally, this document defines behavior for DNS systems which implement this architecture, including the end- point applications which generate and store DNS domain names, and the resolvers, caches and servers which process them. The mechanisms presented here are elective. Developers, zone administrators and network operators who wish to make use of the internationalized domain names may do so according to their own schedule. Those developers, administrators and operators who cannot or prefer not to implement the specified extensions can continue to use their legacy systems, and will still be able to access resources from the internationalized domain name system. 3.1. Background From one perspective, DNS is already an "eight-bit clean" system, in that the structured DNS message is capable of storing and transmitting eight-bit data without any additional effort. However, this perspective only considers one particular facet of the domain name system, and ignores the more critical aspect of Hall I-D Expires: May 2002 [page 4] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 the DNS namespace, which has rules that are entirely different from those which govern the message format. The DNS namespace (or more appropriately, the view of the namespace which applications use and enforce) is governed by rules set forth in RFC952 [RFC952], STD3 [STD3], and STD13, which collectively define the characters that are eligible for use with host names. These rules are meant to provide a common template which may be applied to either the DNS namespace or a local hosts database, such that a query for "host.example.com" can be processed through either system. The range of valid characters currently defined are the letters, numbers and hyphen characters from US-ASCII [ASCII] (additional rules also govern the valid order and length of a host name). Character code values outside of this range are valid in domain name messages, but are undefined when used in the namespace, and are subject to interpretation by the applications which generate them. The host name rules are enforced by almost every application and protocol which uses DNS to identify a host or system. This includes network utilities such as ping and traceroute which simply identify systems by name, and complex protocols such as SMTP which use domain names to determine message-routing paths. Portions of the DNS protocol itself are also affected by these restrictions, such as the domain names which may be used for NS resource records with sub-domain delegation operations (since these servers are connection targets, they are also required to be compliant with the host name rules). Because these domain names are so pervasive throughout the Internet (and even within proprietary applications that run on private networks), it is not possible to declare a "flag day" at which eight-bit domain names will be considered valid encodings of a particular character set. Instead, an extended namespace with a larger set of charset rules must be defined, an extended DNS protocol capable of supporting these domain names must be deployed, and a transitional mechanism which allows the old and new systems to interact must be established. This document attempts to meet these objectives. 3.2. Objectives In broad terms, this document has one overall goal, which is to facilitate the creation and use of an internationalized domain name system around a UCS namespace, a collection of UTF-8 and Hall I-D Expires: May 2002 [page 5] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 legacy-compatible encodings which are suitable for transferring internationalized domain names within DNS and the affected application data streams, and a negotiation mechanism which allows end-point systems to identify the encoding that they will use for a particular operation. One of the objectives stated above is to internationalize the existing DNS namespace, by allowing UCS characters to be used in host names and sub-domain delegations in old and new zones equally. As such, this document does not define a new namespace, but instead defines mechanisms by which leaf-nodes and sub-domains may be created within the existing hierarchy. UTF-8 was chosen as the primary transfer encoding of these domain names for several reasons. For one, there is a wide availability of tools and expertise surrounding UTF-8, and it is already widely deployed within development environments, operating systems and applications. Furthermore, BCP18 [BCP18] requires that new application protocols be able to use UTF-8 as application data, and for many applications, this specifically means domain names which are passed as data. All signs indicate that UTF-8 is currently and will continue to be the preferred eight-bit encoding on the Internet, and this specification embraces this position in its design. However, most of the network services currently in use are bound by the legacy host naming restrictions, and those applications and protocols will also need to be able to interact with resources from the internationalized namespace, even though they will not be compliant with the UTF-8 encoding mechanisms defined in this document. In order to allow these systems to participate, this specification also embraces the use of ACE as a seven-bit backwards-compatible encoding for legacy systems to use. Note that even though a single encoding could have been specified by this document, past and present requirements would not have been satisfied by a single choice. For example, supporting UTF-8 alone would mean isolating legacy systems from resources in the UCS namespace, while supporting ACE alone would not have provided a truly internationalized namespace (the ACE encoded domain names still appear in user data quite frequently). By allowing the UTF-8 and ACE encodings to coexist, the existing and emerging communities can both be served. Because both encodings will be active during the same time period, this document also defines DNS protocol extensions which allow the Hall I-D Expires: May 2002 [page 6] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 end-point systems to detect the encoding that is in use for a particular query/response pair. Note that these negotiation mechanisms not only allow new and legacy systems to interoperate, but they also provide a transition service for developers, zone administrators and end-users, in that ACE encoded domain names can be initially deployed within existing applications and DNS systems, while individual elements of the infrastructure can be upgraded without disturbing other components. 3.3. Common Usage Scenarios Discussion of the mechanism provided by this document depends upon the usage context of the domain names themselves. Domain names are extremely pervasive, and are used by almost every TCP/IP protocol and application in one form or another. However, most usages fall under one or more of the following scenarios: * Connection identifiers. Domain names are most commonly used as host-specific identifiers for outbound connection requests, whether this be for a command-line application such as ping, or as a host name which is stored in an application's configuration file. Another common usage scenario for connection identifiers is with reverse lookups, where a server is logging incoming connections by the corresponding domain name, or where a program such as netstat is displaying all of the application sessions which are currently active on a host. In both of these cases, domain names are passed through applications to a resolver, resulting in DNS queries and responses which eventually provide the requested DNS data. A related use (but one which does not generate DNS messages) is determining the host name of the local system. This is commonly found with applications and protocols that need to display the domain name of the local system as part of a protocol operation (such as an SMTP greeting banner) or as application data. Connection identifiers (and lookups in general) are probably the largest single use of domain names today, and this is likely to be the case with internationalized domain names as well. This document fully supports the use of internationalized domain names for lookup operations, as long as the calling application, the stub resolver, the local caching servers, and the authoritative servers for Hall I-D Expires: May 2002 [page 7] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 the specified domain name are compliant with this specification. If any of these components are not capable of supporting internationalized domain names in this manner, the ACE equivalent domain name will be negotiated for the operation at hand. * Protocol data. Some application protocols exchange domain names as protocol data, with those domain names either determining or altering a service-specific operation. Examples of this usage include SMTP envelopes ("RCPT TO ") where the domain name is used to determine whether or not a particular email message should be accepted for delivery, the HTTP HOST header field which identifies a specific document tree on a shared server, BOOTP/DHCP options, WHOIS input, and more. Because these protocols treat domain names as protocol data, most of these protocols also have specific formatting requirements which must be addressed before UTF-8 domain names can be used by these protocols directly. This document is intended to facilitate the use of UTF-8 encoded domain names in this manner, although it is expected that most of the protocol development groups will need to develop negotiation mechanisms before these protocols can use internationalized domain names directly. Until such work is completed, ACE equivalent domain names can be used to provide these protocols with access to the internationalized namespace. * Structured application data. Structured application data is similar to protocol data in that it can trigger or affect some protocol action, although this will not always occur. For example, a web browser can process an embedded IMG link which may be present in a web page, while a user can manually follow an embedded email link which is also stored in the same web page; even though both usage models share the same structured data format (URLs), they are processed differently by the application. Similarly, email messages typically contain multiple domain names as structured data in the message headers, and some of these domain names will directly affect subsequent protocol operations, while others will not. Because of this ambiguity, this document defines no specific treatment for structured application data. In some cases, no additional mechanisms will be required, while Hall I-D Expires: May 2002 [page 8] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 other scenarios will require negotiation mechanisms before an internationalized domain name can be used in the structured data (with ACE being required as the interim format). Each protocol development group is encouraged to analyze each usage independently, to classify the usage as a connection identifier, protocol data, or unstructured application data, and to determine the appropriate course of action for each usage accordingly. * Unstructured application data. Many application protocols provide free-text data which can contain domain names, but with those domain names existing as unstructured data. For example, an email message which is provided as a text/plain MIME body part may contain a domain name which identifies a system or service in the context of a specific application, but in an unstructured form ("your files were moved from server1 to server2"). Similarly, an email address may be provided in WHOIS output, but as unstructured data which does not affect the protocol. Given the application-specific nature of this data, it cannot be managed by any global protocol or process. Where a protocol has rules or restrictions on the data itself, then those rules are maintained, but some formatting rules may need to be extended before internationalized domain names (or their equivalents) can be encoded in the application data. For example, internationalized domain names in email messages may need to be converted to a preferred display charset, while ACE equivalents may be necessary for protocols which only support US-ASCII. Each of the above scenarios represent distinct handling cases where internationalized domain names may or may not be used directly. In some cases, the internationalized domain names may be used as soon as the applications and resolvers are configured to use them, while in other cases, measured and cautious deployment is required in order to prevent undue breakage. In the latter cases, however, the backwards-compatible ACE encoding is available so that the internationalized domain names can be used. 3.4. User Audiences Another perspective on the changes which will result from deploying the mechanisms described in this document can be seen by analyzing how any such changes will affect the different Hall I-D Expires: May 2002 [page 9] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 "audiences" who work with domain names, and who have their own unique context-specific usage requirements and objectives. The three main audiences discussed in this document are: * Developers. Protocol and application developers need to be able to incorporate internationalized domain names into their systems as easily as possible, although there are many factors which will affect such usage, including the input and output charsets and encodings which are available to the applications and protocols. Where feasible, this specification allows developers to choose any charset or encoding which may be required and suitable for use, although in most cases, a recommendation is also made for the use of UTF-8 in particular. Developers may adopt internationalized domain names for connection identifiers and lookup operations fairly quickly, such that users can use those system as soon as they have compliant systems (and they have a target domain name to communicate with). Implementing support for internationalized domain names in protocols and application data will require additional effort by the affected development groups. Support for ACE will be harder to implement, since it is a relatively new and untested encoding syntax, with no existing developer tools. This will likely be the largest hurdle to overcome when developing applications for use with this service. * Zone administrators. Organizations that wish to deploy internationalized domain names should be able to do so easily, at a reasonable cost, and without suffering excessive pre-conditions. Towards this objective, the mechanisms described by this document allow organizations to deploy and use internationalized domain names within any zone immediately, without requiring any other zone to have been updated beforehand (although there are specific and strong suggestions for upgrading the Internet's high-load servers as soon as possible). If an organization wishes to publish internationalized domain names for users to access and utilize, the authoritative servers for the affected zone must be compliant with the naming rules and message formats described by this document, which will almost certainly Hall I-D Expires: May 2002 [page 10] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 require the administrators of that zone to upgrade their servers. However, organizations may also choose to only deploy ACE encoded domain names if an immediate migration is not feasible, with the caveat that internationalized domain names in their native form will not be available from those zones. * Network operators. The systems and human users which generate DNS lookups are another area of concern, as these protocols, programs and users will expect these lookups to succeed, and will also expect that the visible namespace will be compatible with the capabilities of the requesting system at a minimum investment. This is a broad range of requirements. At a minimum, applications must be capable of generating and accepting the internationalized domain names if they are to use those domain names (see the "Developers" discussion above for the application requirements). Similarly, the local resolvers, caches and forwarders on the user's network must also support the message formats if they are to relay internationalized domain names between their local applications and the remote zones being queried. If the applications, resolvers and caches do not support these requirements, intermediary systems will perform the down-level negotiation automatically on their behalf such that additional effort is not required on the user's part. In summary, the developers, zone administrators and end-users can immediately participate in the internationalized namespace at no additional expense if they are content with using ACE encoded domain names, and can use internationalized domain names in their native form if they are willing to make the necessary investments. Furthermore, since the native and backwards-compatible encodings are not mutually exclusive, implementers of this specification have the option of adopting ACE for immediate use and then transitioning to internationalized domain names on a per-system, per-zone, or per-application basis, according to their schedule. 3.5. Service Overview This document specifies a variety of extensions to several different protocols and services in order to facilitate the use of internationalized domain names anywhere this support exists or can Hall I-D Expires: May 2002 [page 11] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 be implemented, and to provide a legacy-compatible domain name in all other situations. More specifically, this document defines or clarifies behavior for the following elements: * Host name character restrictions. Legacy protocols and applications are currently restricted to the legacy host naming rules, which only allow for a subset of US-ASCII characters (letters, digits and the hyphen character). This document redefines the characters which are valid within a host name so that system identifiers, domain name parts of host names, and new network services can use most of the characters from the UCS. * DNS message format. This document defines an extended label format based on the extended label services provided by RFC2671 (Extension Mechanisms for DNS - EDNS0) [RFC2671], with this label format being used to encapsulate UTF-8 encoded internationalized domain names in DNS messages. Any DNS message which carries the UTF-8 encoded domain names is required to use the EDNS/UTF-8 label type defined in this document. Any DNS message which carries legacy domain names (including the ACE encoded equivalent domain names) is required to use the traditional message format. * Application handling rules. Applications can use internationalized domain names immediately for lookup operations that do not directly affect external services or protocols, and can use ACE encoding sequences to specify internationalized domain names in legacy protocol operations, and can use them both at the same time. * Stub resolvers. Stub resolvers will most likely need to provide a series of internationalized APIs in order to fully support applications that generate internationalized domain name lookups. For example, these APIs will almost certainly be required in order for the resolver to determine that the calling application is compliant with the host name requirements defined by this document, and that the domain names should be encoded in the proper label format. Although this specification does not dictate these APIs, it encourages their use, and provides some guidance on the issues surrounding their use. Hall I-D Expires: May 2002 [page 12] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 * Forwarders, resolving servers and caches. The user-side servers which process internationalized domain names have several protocol-specific requirements, including the negotiated fall-back service when UTF-8 queries fail. * Authoritative servers. A key part of this specification is the simultaneous support for internationalized and legacy compatible domain names in the UCS namespace, thereby allowing a domain name to be entered into an authoritative zone database once, and for the appropriate response to be generated by a server according to the label encoding from the associated query. In order for this to work, this specification requires authoritative servers which serve internationalized domain names to comply with specific conditions. This specification also allows existing servers to serve ACE equivalent domain names when the authoritative servers cannot be upgraded, although this typically results in lower levels of functionality. The elements listed above collectively define a completely internationalized domain name system, which is capable of servicing internationalized domain names in all compliant systems, and which is also capable of providing ACE encoded equivalent domain names when any component from the internationalized service is not available. 3.6. Process Example This section illustrates a series of query/response transactions under which the processes and protocols defined in this document function. This example uses a reverse lookup for the PTR resource record associated with the "14.2.0.192.in-addr.arpa." domain name (forward lookups work similarly, but the issues are more fully demonstrated by PTR lookups). Each of the various technologies shown below are described in later sections of this document. The sole purpose of this example is to provide an illustration of these mechanisms in order to facilitate better discussion. Note that this illustration represents a worst-case scenario (thereby exercising most of the functionality provided by this specification), and does not represent a typical scenario. a. First, a PTR resource record for 14.2.0.192.in-addr.arpa. is added to the internationalized zone database on the replication master server for the 2.0.192.in-addr.arpa. Hall I-D Expires: May 2002 [page 13] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 zone, with the resource record data value of "host..example.com." (where is an internationalized domain name compliant with the host naming rules provided in this document). Both of these domain names have a primary representation consisting of UCS characters in some local encoding, but are also available as UTF-8 and ACE encoded data so they can be encapsulated within DNS queries and responses. Once the zone is reloaded and is replicated by the other authoritative servers for that zone, the domain names can be processed. b. An application on a remote system generates a DNS lookup for the PTR resource record associated with the 14.2.0.192.in-addr.arpa. domain name. If this is a legacy application, it issues the lookup using the only method it knows, which is to pass the domain name to the legacy resolver API. This would result in the resolver issuing a legacy DNS query for the PTR resource record associated with the specified domain name. If this application is compliant with this specification, it performs the following steps: 1. Verify that the resolver is capable of processing queries for UTF-8 domain names by probing for an internationalized API. If this step failed, then the domain name would be converted to the legacy STD13 octet encoding in step 3.6.b.3 and passed to the resolver's legacy API. 2. Convert the domain name from its generated encoding to the canonical UCS characters, and then normalize and case-convert the UCS characters. 3. Convert the normalized and lowercased UCS characters to the charset or encoding used by the resolver's internationalized API. 4. Issue a lookup for the PTR resource record associated with the internationalized domain name, via the resolver's internationalized API. Hall I-D Expires: May 2002 [page 14] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 Note that even though the domain name is compatible with the legacy host name rules, the domain name is passed through the internationalized API so that servers can tell whether or not the original application is UTF-8 compliant, and can determine the format of any internationalized domain names which are to be returned in the response messages. This is required in case the queried resource record includes internationalized domain names as resource record data (as would be the case with PTR resource records), and is also required for the proper handling of any SOA or NS resource records which may be returned as additional data in the response. For the purpose of this example, we will assume that each of these steps were successfully performed. c. The client's stub resolver generates the query, with the Question Section of the query containing the UTF-8 encoded domain name encapsulated in an EDNS/UTF-8 extended label. d. The stub resolver sends the query to one of its configured resolving servers. e. The resolving server will either answer the query from its cache or forward the query to a name server which is authoritative for the namespace hierarchy, as per the normal query-resolution procedure. For the purpose of this example, we will assume that the server has no information about the specified domain name, so it forwards the query to one of the root zone's authoritative servers in order to begin the iterative resolution process. f. The queried server responds with a referral, providing delegation data for a zone in the path to the queried domain name. For the purposes of this example, we will use 192.in-addr.arpa. as the delegation domain specified in the referral message. The specific format of the referral will depend on whether or not the queried server understands the EDNS/UTF-8 label encoding. If the server is compliant with this specification (which it is, or else it wouldn't have answered with a referral), then the referral will also provide ENDS/UTF-8 encoded domain names in the Authority and Additional-Data Sections of the referral. If the server Hall I-D Expires: May 2002 [page 15] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 was not compliant with this specification, it would return an error upon seeing the extended label type, which would cause the resolving server to restart the query using the legacy label type. g. The resolving server decodes the UTF-8 encoded domain names to their UCS character representation, caches the resource records in their UCS form, and sends the query to one of the authoritative servers for the referral zone. Note that the cache did not normalize or case-convert the UCS characters; only the end-systems perform this work. h. In this case, the queried server does not understand the EDNS/UTF-8 label format, and has returned a FORMERR response code. i. When these errors are encountered, the current resolver (whether this is the client's stub resolver or a caching server in the query path) must convert the query domain name from its current form to a legacy-compatible encoding (either ACE or STD13 octet sequences, depending on the UCS characters which have been encoded), and then has to reissue the query in that format. In this case, the domain name only contains printable characters from US-ASCII, so the STD13 octet encoding is used for the fall-back query. Because the UCS domain name was normalized and lowercased before it was passed to the client's stub resolver, the legacy domain name will also be in this format (although it will be compared in a case- neutral form by the recipient server). Note that once this conversion takes place, the legacy label format is used for the remainder of the current query chain (this prevents excessive delays from multiple fall- back operations, which could result in timeouts at the original resolver or application). Hall I-D Expires: May 2002 [page 16] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 j. The queried server returns a delegation referral for the 2.0.192.in-addr.arpa. zone. Since the query arrived in the STD13 octet encoding, the server has no indicator of the client's capabilities, so the referral NS resource records will also be returned in legacy compatible form (either as STD13 octet sequences or as ACE encoded data, depending on the character codes provided in each label from each of the associated domain names). Note that even though these NS resource records will be restricted to legacy-compatible host names and label types, they may contain and reference ACE domain names. In this regard, a legacy server in the delegation path does not prevent internationalized domain names from being delegated or resolved, but only prevents them from being processed as EDNS/UTF-8 extended labels. Also note that once the authoritative servers for a zone have been discovered and cached, any subsequent UTF-8 queries which are generated for the resources in that zone will be sent directly to one of those servers, bypassing the delegation hierarchy. As such, subsequent queries which are provided in EDNS/UTF-8 labels can be processed directly by the zone's authoritative servers, without the delegation servers disrupting the process. k. The resolving server decodes the STD13 octet sequences and ACE encoded domain names to their UCS character representations, caches the resource records, and resends the query to one of the authoritative servers for the referral zone. l. The queried server processes the request. Since this query arrived as an STD13 octet sequence, the server must compare the seven-bit characters from the domain name (which is all of them, in this example) in a case-neutral form. Note that if the query had arrived as ACE or UTF-8 encoded domain names, the server would have decoded the specified domain name to its canonical UCS characters and performed a case- exact match against the resulting characters. m. The queried server responds with the requested data. Note that the query was submitted in the legacy label form due to the fall-back processing which occurred in step 3.6.i, so the server will only respond to this query with STD13 Hall I-D Expires: May 2002 [page 17] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 octet sequences or ACE encoded domain names, using the STD13 legacy label. n. The resolving server decodes the STD13 octet sequences and ACE encoded domain names to their UCS character representations, and caches the resource records. Since the query was originally received as an internationalized domain name (as indicated by the EDNS/UTF-8 extended label from the original query), the resolving server has to encode the answer data as UTF-8 before passing it back to the client's stub resolver. However, since the input was not provided in an encoded UCS form, the server has to normalize and case-convert the STD13 octet sequence in order to provide a valid internationalized domain name. o. The stub resolver decodes the UTF-8 encoded domain names which have been provided in the response message to their UCS character representation, and passes the data to the original calling application using the charset or encoding favored by the resolver. p. The application validates the received domain name by decoding the internationalized domain name to its canonical UCS characters, normalizing and down-casing the resulting domain name, and comparing the results with the answer data which was provided by the resolver. As can be seen, the UTF-8 name resolution process is identical to the current resolution process, with the addition of a single fall-back query in step 3.6.i which resulted in one extra query/response pair (roughly equivalent to adding one extra delegation referral into the query path), and with several different encoding conversions, as required by the participating systems and services. This example also illustrates the requirements which are placed on developers, zone administrators, and network operators in order for typical connection identifier services to function with UTF-8 domain names. However, if each system and service had used UTF-8 for encoding purposes (including everything between the stub resolver's APIs and the authoritative servers for the target zone), then no additional queries or conversions would have been required (other than the direct UCS conversions required for validation and caching, the latter of which can be performed separately without affecting the processing path). In this regard, the example above illustrates how this system can function even when only a portion Hall I-D Expires: May 2002 [page 18] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 of the participating systems utilize UTF-8, and also illustrates how effective the entire operation would be if all of the recommendations and requirements provided in this specification were adopted. It is also important to reiterate here that any such costs associated with this compliance are entirely elective by the affected parties. If they want to streamline the process, the option is available to them, although the system also works when very few optimizations are implemented. 4. The Internationalized Namespace In simple terms, this specification defines an internationalized namespace which consists of domain names and labels that contain UCS character codes, and also specifies a series of encoding formats which may be used whenever the UCS values need to be encapsulated for transmission within DNS messages or application data streams. In this regard, the internationalized namespace is the UCS representation of the domain names and labels as they are used for comparison operations once a domain name arrives for processing, while the transfer encodings ensure that a domain name arrives at the destination system intact, so that it may be processed in its canonical form. There are four conceptual elements to this model: * Character codes. Labels from internationalized domain names have a single logical canonical representation as sequences of UCS code point values. The UCS characters are used when a particular label from a domain name is created by an application, stored in a zone, hosts or cache database, and is used whenever two sets of domain names or labels need to be compared. However, different kinds of domain names have different rules which govern the character codes that may be used. * Storage encodings. Whenever a domain name is created or copied from the network, it must be stored in a format that is reversible to the canonical UCS character representation of that domain name. This specification does not mandate or require any particular storage encoding, and allows this decision to be made on a per-implementation basis, as long Hall I-D Expires: May 2002 [page 19] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 as the storage encoding supports character codes which can be converted to UCS equivalent values for comparison purposes. However, the use of UTF-8 for this purpose is encouraged, since it is the most common. * Transfer encodings. Whenever a domain name needs to be sent over the network, it must be packaged in a form which is compliant with the capabilities of the transfer protocol in use. This document specifies three transfer encodings which may be used to encode canonical UCS character codes in DNS messages or application streams, which are: the octet encoding from STD13, the ACE encoding from , and the UTF-8 encoding from RFC2279. Each encoding has different costs and benefits in different usage scenarios. * Comparison operations. When two domain names need to be compared, they also follow rules which are appropriate to the type of domain name being provided, and the transfer encoding which may have been used to provide the domain name to the system. This document defines four distinct types of internationalized domain names which may exist in the internationalized namespace, and also describes how each of the above considerations affect those domain names and their labels. These domain name types are described throughout the remainder of this section. 4.1. Internationalized Domain Names and Labels This section describes the master template rules for all domain names and labels which may be used in the internationalized namespace, although subordinate rules and restrictions are also applied as secondary filters, depending on the intended usage of the domain name. For example, domain names and labels which are to be used as internationalized host identifiers (either as host names, or as domain names which are used to specify a host) are restricted to a specific subset of UCS characters. Meanwhile, domain names and labels which are compliant with STD13's global rules are restricted to eight-bit code values, while the domain names and labels which are used as STD13 host identifiers are restricted to a specific subset of US-ASCII. Hall I-D Expires: May 2002 [page 20] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 The following diagram illustrates how the subordinate rules are applied and interpreted against the master restrictions: +-----------------------+ | Internationalized DNs | +-----------------------+ any UCS character codes / | / | / | / | +-----------+ +-----------+ +------------+ | Int. Host | | STD13 DNs +-----+ STD13 Host | +-----------+ +-----------+ +------------+ normalized character ASCII letters, subset of codes 0x00 numbers, and UCS chars through 0xFF hyphen char As can be seen, the internationalized domain names and labels rules allow any UCS character code to be stored, although each particular usage of the domain names and labels will have their own secondary rules and restrictions. In order to allow future documents to define additional rules as required for their usage, this document defines very few global rules on the core internationalized domain names and labels. 4.1.1. IDN syntax and structure In this specification, an internationalized domain name consists of a variable number of labels, each of which contain a variable number of UCS character codes, not all of which will have defined UCS character interpretations. Furthermore, the encoding system which is used to store and interpret those values on a system is not relevant to this specification, and is therefore not defined. The characters in a label can be stored in memory or on disk as UTF-8, UCS-4, ACE, or any other storage encoding which is desired by the operators and implementers of the affected system, as long as that encoding system is reversible to the canonical UCS character code values, and is able to represent the necessary range of UCS characters (the "necessary range" varies by operation). Hall I-D Expires: May 2002 [page 21] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 The only universal restrictions which apply to internationalized domain names and labels are those which govern length. This specification requires that labels from internationalized domain names MUST be restricted to a minimum length of two characters and a maximum length of 63 characters, inclusive. The exception to this rule is the root domain, which is always represented by a zero-length label. Note that this rule specifically refers to the canonical UCS characters, rather than any encoded form (encoding will often result in labels and domain names with fewer actual characters, due to overhead from the encoding algorithm). A fully-qualified internationalized domain name is formed by joining a series of labels together, with the most-contextually specific label in the left-most position of the label sequence, and with the root domain occupying the right-most position. The sum total of all labels in an internationalized domain name MUST NOT exceed 255 characters, inclusive. Any number of labels MAY be stored in the domain name, but the sum total of their lengths MUST NOT exceed this limit. However, labels which contain UCS character codes greater than U+007F will result in multi-byte UTF-8 and ACE encodings, so the maximum length of a label or an internationalized domain name is governed by their UTF-8 and ACE encoded lengths. Both encodings MUST result in an encoded length of 63 octets or less in order to be usable, with a maximum cumulative length of 255 octets. 4.1.2. IDN transfer encodings The UCS is currently occupies a 21-bit range of character code values, containing tens of thousands of assigned characters, and hundreds of thousands of unassigned characters. Due to the multi- byte nature of the code point values, UCS characters cannot be passed as protocol or application data in most of the existing Internet protocols (including DNS messages), at least not without the help of some kind of encoding scheme. At the very least, the UCS character values have to be encoded as eight-bit sequences if they are to fit within existing eight-bit data structures, and have to be encoded as a subset of US-ASCII characters if they are to be usable with legacy protocols and applications which only use STD13's host identifier rules for their structured domain name data types. With this objective in mind, this document defines three different transfer encoding systems which can be used to convert Hall I-D Expires: May 2002 [page 22] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 internationalized domain names and labels into a form which is suitable for transfer in different data streams. These are the legacy STD13 octet encoding, ACE, and UTF-8. Each of these encoding schemes provide different benefits and capabilities to the internationalized DNS effort. * STD13 octets. The STD13 octet encoding scheme provides a direct one-to-one mapping between eight-bit characters and their eight-bit values, but it is only capable of storing character codes in the range of U+0000 through U+00FF, which severely restricts its usefulness. * ACE. The ACE encoding scheme is capable of storing UCS character code value as seven-bit sequences in STD13 legacy labels. While this makes it practically compatible with the legacy host identifier rules, the resulting data imposes additional labor on the Internet community, and the reuse of the legacy label also results in certain amounts of ambiguity with some DNS domain names and labels. * UTF-8. The UTF-8 encoding scheme is capable of encoding all UCS character code values as sequences of eight-bit data which are compatible with legacy DNS message restrictions, but the encoded output requires explicit support from internationalized applications and protocols. UTF-8 output uses a new label type in order to prevent additional ambiguity problems from arising. The table below illustrates the UCS character code sequences which are supported by each of the different encoding schemes. STD13 Octets ACE UTF-8 +-------+-------+-------- | | | US-ASCII | Y | | Y | | | Eight-Bit | Y | Y | Y | | | Any UCS Chars | | Y | Y | | | Hall I-D Expires: May 2002 [page 23] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 More specifically, the character code sequence ranges and their valid encodings are: * US-ASCII. If a label only contains character codes from the range of U+0000 through U+007F, then it MAY be encoded as a legacy STD13 octet sequence or UTF-8, but MUST NOT be encoded as ACE. Note that this specification explicitly prohibits seven-bit labels from being encoded as ACE data, since such an action would be redundant, results in greater processing overhead for those labels, and multiple representations introduce problems with caches on legacy systems. Furthermore, certain security risks would be introduced if this were allowed. For example, a malicious user could register or purposefully create an ACE encoded representation of the "example.com" label sequence such that users mistakenly sent sensitive data to malicious systems. In order to prevent these problems from occurring, this specification requires that any ACE-encoded label which consists entirely of seven-bit characters MUST be immediately discarded with extreme prejudice. This rule applies to every implementation of this specification, including any applications, resolvers, caches or servers which process labels. * Eight-bit codes. If a label contains character codes from the eight-bit range of U+0000 through U+00FF, then it MAY be encoded as STD13 octet sequences, ACE, or UTF-8. This rule specifically requires that the label MUST contain at least one character from the eight-bit range, MAY contain any number of characters from the seven-bit range, but MUST NOT contain characters with code values which are greater than U+00FF. Since the STD13 octet encoding and ACE both use the legacy STD13 label type, this specification relies on the input encoding of a domain name in order to determine the output encoding. In some cases, however, the input encoding will not be clear, or will not be specified, and this can result in some ambiguity with label sequences from this range. For example, if the domain name provided in a query consists of seven-bit labels, then the STD13 octet sequence is the only valid encoding for the legacy STD13 label, Hall I-D Expires: May 2002 [page 24] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 meaning that ACE could not have been used in the query. If the specified domain name exists as a CNAME resource record which refers to a domain name that contains eight-bit character codes, then the proper output encoding for that domain name will not be clearly discernable. Moreover, the STD13 and ACE encodings will generate different results, since the STD13 octet sequence will only contain a single octet for the eight-bit character, while the ACE encoding will contain multiple octets of encoded data. When this situation arises, systems MUST give preference to the ACE encoding, on the assumption that the referenced character is more likely to represent a UCS character than an eight-bit code value (the UCS characters in this range are Latin-1, which are the most common characters after the legacy US-ASCII set). Furthermore, the ACE encoded representation of these characters allow for a broader range of subsequent operations (since it complies with the legacy host naming restrictions, it can be used with CNAME resource records that refer to hosts), while the STD13 octet encoded representation does not. It is possible to avoid this scenario on authoritative zone servers (and thus the affected caches) by allowing the operator to specify whether or not the input is Latin-1 UCS character data or binary data, with the server generating the proper output accordingly. Also note that the default encoding specified by this document is UTF-8, which does not suffer from the ambiguity problems described above. * Any UCS character codes. If a label consists of any character codes greater than U+00FF, then it MAY be encoded as ACE or UTF-8, but MUST NOT be encoded as STD13 octet sequences. STD13 is not capable of representing character codes greater than U+00FF, so it cannot be used with any UCS characters beyond the eight-bit range. Encodings are performed on a per-label basis. Each label MUST NOT be encoded more than once. Also note that recursive encodings result in applications discarding the domain name. When the STD13 octet encoding is used to encode labels for transmission, the labels are encoded according to the rules specified in STD13, and are encapsulated in STD13 legacy labels. Hall I-D Expires: May 2002 [page 25] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 When ACE is used to encode labels for transmission, the labels are encoded according to the rules specified in , and are encapsulated in STD13 legacy labels (this process is described in section 5.2). When UTF-8 is used to encode labels for transmission, the labels are encoded according to the rules specified in RFC2279, and are encapsulated in EDNS/UTF-8 extended labels (the format of this label is described in section 5.1). Note that a domain name MAY contain any combination of STD13 octet encoded labels and ACE encoded labels. However, if a domain name contains any UTF-8 encoded labels, then ALL of the labels from that domain name MUST be encoded as UTF-8 data. This rule primarily exists so that DNS compression services can be maintained consistently, but it also prevents mixed referrals which can trigger unnecessary fall-back processing, and also provides a single encoding representation to internationalized systems which benefits efficiency. The root domain (as specified by the zero-length label at the right edge of the domain name) MUST NOT be encoded with ACE. More specifically, zero-length labels MUST NOT contain any character data of any kind, and since ACE labels have prefix strings, they are explicitly forbidden from being used for the root domain. 4.1.3. IDN comparison operations When an internationalized domain name label is received from the network as ACE or UTF-8 encoded data, the labels MUST be decoded to their canonical UCS character representation, and the resulting UCS characters MUST be compared as case-exact sequences to their stored equivalents. Except where specifically required in this specification (EG, validity tests which are performed by applications), normalization and case-conversion MUST NOT be performed against the resulting UCS character codes prior to any comparison operations being performed. However, internationalized domain name labels which are received as STD13 octet sequences MUST be given special treatment, as these domain names could have originated from legacy systems operating under STD13's rules. In this case, the seven-bit US-ASCII alphabetic characters (U+0041 through U+005A, and U+0061 through U+007A) from those labels MUST be compared in a case-neutral form. All other code values MUST be compared as case-exact code values Hall I-D Expires: May 2002 [page 26] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 (this particularly includes eight-bit characters, which were not defined by STD13). 4.2. Internationalized Host Identifiers Internationalized host identifiers are a subset of the internationalized domain names described in section 4.1, which only use a subset of the allowable UCS characters, but which reuse the global transfer encodings and comparison routines. Most of the displayable characters from the UCS can be used in host identifiers, and there are no additional rules governing the ordering or length of their labels. However, the characters which are used in internationalized host identifiers MUST be normalized and case-converted before they are encoded for storage or transfer. This requires more effort on the part of applications and servers when the internationalized domain names are initially created, but results in less ambiguity and lower processing requirements for servers, caches and resolvers during subsequent comparison operations. The restrictions which govern the creation of internationalized host identifiers are as follows: a. Labels MUST be restricted to the subset of characters which are permitted by [nameprep]. Characters which are prohibited by MUST NOT appear in any label of any internationalized host identifier. b. Labels MUST be normalized through before they are stored or encoded for transfer. Internationalized host identifiers will not be normalized as part of any comparison operation, so systems MUST normalize the labels before they are stored or transmitted. c. Labels MUST be converted to lowercase according to the case-mappings rules specified in before they are stored or encoded for transfer. Internationalized host identifiers will not be converted to lowercase as part of any comparison operation, so systems MUST normalize the labels before they are stored or transmitted. According to the rules above, a label from an internationalized host identifier which was originally created with the UCS character sequence of (U+0041 U+0301 U+0042) would be normalized and lowercased to (U+00E1 U+0062). The normalized, lowercase form would be used as the canonical UCS character representation of that label when it was encoded for storage and transmission purposes, and would be the form which was used for comparison operations on any resolvers, caches and servers. Internationalized host identifiers which are received from the network can contain labels which have been encoded as STD13 octet sequences, ACE or UTF-8. In all of these cases, the comparison rules defined in section 4.1.3 MUST be applied. 4.3. STD13 Domain Names STD13 allows any eight-bit code values to be used in domain name labels. However, STD13 host identifiers (as described in section 4.4 of this specification) are the most common form of STD13 domain names, and have much tighter restrictions. There are common uses of STD13 domain names which do not comply with the STD13 host identifier subset, however. One common example of this is SRV identifiers, which use an underscore character (U+005F) as part of their label syntax. Another common example is found when email addresses are provided in SOA and RP resource records, and where the left-hand side of the email address is stored as an STD13 domain name label which does not represent a host identifier. Furthermore, email addresses often contain extra characters which are not legal in STD13 host identifiers, such as a full-stop character (U+002E). For example, "joe.admin" could be stored as an STD13 domain name label in the fully-qualified domain name of "joe.admin.example.com.", which would represent the email address of "joe.admin@example.com" when that domain name was extracted from the SOA or RP resource record and processed. Implementations of this specification MUST allow STD13 domain names to be created and stored, using the following rules: a. Labels MUST be restricted to the code values of U+0000 through U+00FF. Restrictions on character content MUST NOT be applied (note that if this domain name will be used as part of an STD13 host identifier, the rules specified in section 4.4 MUST be used instead). Hall I-D Expires: May 2002 [page 28] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 b. Labels MUST NOT be normalized or lowercased before they are stored or encoded for transfer. c. Systems MUST allow STD13 domain names to be specified as exact sequences of eight-bit octet values, and MUST NOT treat these sequences as canonical UCS characters which are normalized or lowercased. STD13 defines an escaping mechanism whereby the decimal value of the octet is prefaced with a reverse-solidus (such as "\193"), which is suggested for this usage. STD13 domain names which are received from the network can contain labels which have been encoded as STD13 octet sequences, ACE or UTF-8. In all of these cases, the comparison rules defined in section 4.1.3 MUST be applied. Note that some of these sequences can contain octet code values which have not been normalized or lowercased by the originating system, since these values can be used to specify binary domain names. 4.4. STD13 Host Identifiers This document does not deprecate, replace or modify the host name rules defined by RFC952, STD3 or STD13 as they apply to legacy host identifiers. However, there are several issues which affect the usage of these domain names and their labels in this system. The range of characters which are currently defined as valid in STD13 host identifiers are the uppercase and lowercase letters, numbers and hyphen character from US-ASCII. No other characters are allowed to be used. Furthermore, the current rules also prohibit the use of the hyphen character in the first or last character position of a host identifier label. Implementations of this specification MUST allow STD13 host identifiers to be created and stored, using the following rules: a. Labels MUST be restricted to the code values of U+002D, U+0031 through U+0039, U+0041 through U+005A, and U+0061 through U+007A. b. Labels MUST NOT contain the code value of U+002D in either the first or last character position of the label. Hall I-D Expires: May 2002 [page 29] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 c. The alphabetic characters MUST be converted to lowercase before they are stored or transmitted. STD13 host identifiers are always compared in a case-neutral form. STD13 host identifiers which are received from the network can contain labels which have been encoded as STD13 octet sequences UTF-8. In both cases, the comparison rules defined in section 4.1.3 MUST be applied. 5. Transfer Encodings and Label Types As was discussed in section 4.1.2, internationalized domain names and labels are required to be encoded as either eight-bit or seven-bit data whenever they are transmitted as protocol or application data. The particular output encoding format which will be used for any given label will be primarily determined by the capabilities of the participating end-point systems. If the application or protocol which is relaying the domain name labels supports internationalized domain names directly then UTF-8 encoded labels can be used, but if the protocol or application is only capable of supporting STD13 host identifiers as domain name data, then the STD13 octet and/or ACE encoded labels will have to be used. With DNS messages in particular, the "data type" is the label encapsulation in use. Although STD13 legacy labels allow for the use of eight-bit codes, multiple encodings for the same basic character data result in interpretation problems without some form of ancillary tagging service. For this reason, each encoding is represented differently by this specification. When the STD13 legacy label contains STD13 octet sequences then no tagging is provided, but if the STD13 legacy label contains ACE encoded data then the encoded sequence is tagged with an ACE identifier (a character prefix which does not normally appear in labels). When UTF-8 domain names are provided, an EDNS/UTF-8 extended label is used to encapsulate the internationalized domain name. Furthermore, the encoding which is used for any label in the message will also determine the label type which is used to encapsulate and transfer the entire domain name. If any label contains EDNS/UTF-8 extended labels, then all of the labels from that domain name are required to be encapsulated for transfer in EDNS/UTF-8 extended labels. Conversely, if a domain name contains ACE or STD13 octet encoded labels, then all of the labels from Hall I-D Expires: May 2002 [page 30] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 that domain name are required to be encapsulated for transfer using the STD13 legacy label format. Note that other legacy applications and protocols will most likely be required to provide extended encodings or negotiation features before they can exchange internationalized domain names directly. However, new applications and protocols which are subsequently written to comply with BCP18 and this specification should not require any such effort, as they should be capable of transferring UTF-8 domain names from the beginning. 5.1. The EDNS/UTF-8 Label Type Any internationalized domain name label which has been encoded as UTF-8 for transmission in a DNS message MUST be encapsulated as a EDNS/UTF-8 label. The EDNS/UTF-8 extended label is an instance of EDNS extended label types (as defined by RFC2671). Extended labels are indicated by the leading bit pattern of 0b01 in the label type field (the first two bits from the "label length" octet of the STD13 legacy label type), with the remaining six bits of this octet indicating the extended label type in use. The EDNS/UTF-8 label type uses the binary value of 0b000011 for this indication (note that IANA may change this assignment). EDNS/UTF-8 labels contain two subordinate units of data. The first octet contains a length indicator which works exactly the same as the length octet as used by STD13 legacy labels: if the first two bits of this octet are 0b00 then the rest of that octet provides the length of the label data field, but if the first two bits of this octet are 0b11 then the label is a pointer to some other label, and the remainder of the length octet provides an off-set which points to the length octet of the referenced label, as per the rules provided in section 4.1.4 of RFC 1035 (STD13, part 2). Hall I-D Expires: May 2002 [page 31] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 The structure of the EDNS/UTF-8 extended label is illustrated by the following figure. 1 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0 1|0 0 0 0 1 1| length | label data /// | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0b01 - The extended label identifier. 0b000011 - The EDNS/UTF-8 extended label type identifier. Length - The number of octets in the label data, or the off- set to the length octet of another EDNS/UTF-8 label. Label data - The label data, encoded as UTF-8 octets. The following example shows the domain name of me.com, where the "e" in "me" is the UCS character (U+00E9), which has the UTF-8 encoded octet sequence of 0xC3A9. +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 20 | 0 1 0 0 0 0 1 1| 0x03 | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 22 | 0x6D (m) | 0xC3 (e') | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 24 | 0xA9 (e') | 0 1 0 0 0 0 1 1| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 26 | 0x03 | 0x63 (c) | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 28 | 0x6F (o) | 0x6D (m) | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 30 | 0 1 0 0 0 0 1 1| 0x00 | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ Octet 20 identifies the EDNS/UTF-8 extended label type, while octet 21 indicates that the label is three octets long. Octet 22 contains the UTF-8 value for lowercase "m", while octets 23 and 24 contain the UTF-8 value for the UCS character (encoded as 0xC3A9). Similarly, octet 25 identifies another EDNS/UTF-8 extended label type, while octet 26 indicates that the label is three octets long, while octets 27 through 29 contain the UTF-8 values for the lowercase alphabetic sequence of "com". Hall I-D Expires: May 2002 [page 32] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 Finally, octet 30 identifies another EDNS/UTF-8 extended label type, while octet 31 indicates that the label is zero octets in length, thereby signifying the root zone (the end of the queried domain name). Note that the use of the EDNS/UTF-8 extended label type serves multiple purposes. On the one hand, it provides a method of signaling the resolver's capabilities to the server, so that the server can determine which format it needs to use when returning answers, referrals or errors. Moreover, using an encapsulation format which is not backwards compatible prevents certain ambiguity problems which can result from overloading the STD13 legacy label with multiple encodings. These problems are seen in certain situations with STD13 octet encoding and ACE, where a server cannot adequately determine which encoding a resolver desires. By using a separate extended label type for UT-8, these kinds of ambiguities are avoided. There are additional benefits which come from using EDNS extended label types, which are best expressed as "future possibilities". Once the EDNS extended label mechanisms are widely deployed, it becomes feasible to specify additional encoding mechanisms as soon as the Internet community deems it desirable. In this regard, defining alternative encodings is much easier the second time. 5.2. The STD13 Legacy Label Type Any internationalized domain name label which has been encoded as ACE or STD13 octet sequences for transmission in a DNS message MUST be encapsulated within an STD13 legacy label. This document does not deprecate, replace or extend the STD13 octet encoding or label encapsulation rules defined by STD13. However, this document does provide some guidance on the creation and interpretation of ACE encoded labels when they are stored in legacy labels, which is necessary in order for recipient systems to properly detect and decode the label contents. Note that STD13 octet sequences and ACE data MAY both be provided the same domain name. As such, each STD13 legacy label from a DNS message must be examined and processed independently. Hall I-D Expires: May 2002 [page 33] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 5.2.1. ACE encoded labels ACE encoded labels always begin with the character sequence of (this document uses "zz--" as a placeholder sequence until a formal assignment is made). Any label which contains ACE encoded data MUST begin with this character sequence prefix. Similarly, any label which begins with this character sequence MUST be recognized and processed as an ACE encoded label, according to the rules defined in this specification. Encoding and encapsulating a label as ACE data is a three-part process, as follows: a. Encode the canonical UCS character data from the internationalized domain name label into ACE using the procedure defined in b. Preface the encoded output with the "zz--" prefix sequence, thereby indicating that this label contains ACE encoded UCS character data. c. Determine the length of the encoded data and store this value in the STD13 legacy label's length octet. Decoding an ACE label is the opposite of that process. Note that whenever the ACE algorithm encounters a seven-bit character code in the input, it is passed through unmodified to the encoded output. If a label only contains seven-bit character codes, the label MUST NOT be encoded as ACE, and MUST be encoded as either STD13 octet sequences or UTF-8. Forcing a seven-bit label to be encoded as ACE serves no benefit, incurs additional processing on the end-point systems, and can also expose certain security risks. Any system which is capable of generating and deciphering ACE encoded labels is required to treat such sequences as hostile, and MUST dispose of them immediately without any further processing immediately; systems are forbidden to even return these labels in DNS error messages. Similarly, ACE MUST NOT be used to encode any zero-length labels (including but not specifically limited to the root domain), since the presence of prefix characters in these labels can invalidate their protocol-specific interpretations. When an STD13 legacy label is received which has "zz--" in the first four character positions, the label MUST be treated as an Hall I-D Expires: May 2002 [page 34] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 ACE-encoded internationalized domain name, and MUST be decoded to its canonical UCS character values for further processing. Note that STD13 legacy labels MUST be verified before the ACE encoded data is extracted (as per the rules defined in STD13 which govern the STD13 legacy label type), but systems which are compliant with this specification MUST perform all subsequent comparison, caching, or storage operations against the canonical UCS characters, and MUST NOT use the ACE encoded label sequence for any of these operations. Note that the legacy systems which are not compliant with this specification will treat ACE encoded labels as any other STD13 legacy label. 5.2.2. STD13 octet encoded labels Any STD13 legacy labels which do not begin with the ACE prefix MUST be treated as STD13 octet encoding sequences. The rules for this process are defined by STD13's default label encapsulation services, although this document also provides some clarifications on the use of this encoding with internationalized domain names and labels. Whenever the STD13 octet sequence is used to encode the labels from an internationalized domain name, the octet values of the canonical UCS characters are stored directly in the label. Because the DNS message is limited to octets, the range of UCS character codes which are eligible for use with STD13 octet sequences is limited to U+0000 through U+00FF. If any UCS character codes outside this range need to be transferred, the internationalized domain name label will have to be encoded as ACE or UTF-8. Note that comparison operations for the seven-bit range of alphabetic character values MUST be performed in a case-neutral form, although eight-bit code values MUST NOT be normalized or case-converted as part of a comparison operation. These rules are required in order to ensure backwards compatibility with the STD13 compliant systems which may be generating these labels as parts of an STD13 domain name while also supporting the normalization and case-conversion which may have been applied to the UCS characters in the storage or transfer encoding systems. Hall I-D Expires: May 2002 [page 35] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 6. Application Guidelines As was discussed in section 3.3, there are multiple scenarios in which an application can make use of internationalized domain names, ranging from simple lookups of connection identifiers to abstract encapsulations of unstructured application data. This is an extremely broad range of uses, which is complicated by the extreme pervasiveness of applications and protocols that use domain names for one or more of these purposes. Furthermore, network applications face a complex array of input and output operations which will cumulatively affect the ability of that application to make use of the internationalized domain name system for various services and functions. These issues are illustrated by the figure below: [IDNs] [IDNs] | ^ | | +------V------+ +------+------+ | input | | output | | charset | | charset | +-----------+-+ +-+-----------+ \ / +---+-----+---+ | Application | +---+-----+---+ / \ +-----------+-+ +-+-----------+ | lookups | | app data <---> [IDNs] +------+------+ +-------------+ | +------+------+ | resolver <---> [IDNs] +-------------+ As can be seen, the ability for an applications to complete adopt internationalized domain names will be determined by many factors, any one of which could prevent the application from completely incorporating the restrictions and recommendations prescribed by this specification. In order to allow for a flexible adoption schedule, this specification defines very few mandates that applications must adopt, but instead focuses on recommendations which applications should comply with whenever they need to use internationalized Hall I-D Expires: May 2002 [page 36] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 domain names, and also provides recommendations for situations where the preferred behavior is not feasible. Applications which are compliant with all of the recommendations provided in this specification will be able to generate, store, transfer and resolve internationalized domain names throughout all of their operations, using UTF-8 as a common encoding for all of these operations. Meanwhile, applications which are not in complete compliance with this specification will still be able to make use of the internationalized domain names in these operations, although such access may be limited to using backwards-compatible encodings which require greater amounts of effort to implement and which provide fewer benefits. 6.1. Input and Output Charsets If an application is unable to accept, process, store or display characters from the complete UCS repertoire, that application's support for internationalized domain names will be somewhat limited, by definition. Although this document does not mandate any particular charset or encoding which all applications must use for all operations, applications SHOULD use coded character sets or encodings which can handle characters from a reasonable number of scripts. In particular, the following areas have specific requirements: * Input charsets and encodings. Since UTF-8 is used as the default encoding for internationalized domain names throughout this specification (and others, such as BCP18), UTF-8 is also RECOMMENDED for use with input encodings of internationalized domain names in particular, although this is not required. Many platforms and development environments support UTF-8 as a local encoding of the UCS and it can be reasonably used with many types of input (such as configuration files), although many systems will require a specific encoding (such as UCS-2, or ISO/IEC 8859-1) in situations which require memory access or keyboard input. Regardless of the input encodings used, implementations MUST map domain names and labels to their canonical UCS characters for any normalization and case-conversion work which is subsequently required by any DNS lookups (see section 6.3). Hall I-D Expires: May 2002 [page 37] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 * Output choices will likely be limited to a system-preferred charset or encoding. In general, this document RECOMMENDS that output systems choose an output charset or encoding which reflects the data being provided. However, applications MUST NOT display unknown characters with generic replacement characters (such as boxes or circles) if it is known that the original characters are not available for display with the specified charset, as such characters will almost certainly trigger failure conditions in subsequent protocol operations. In those situations where adequate input or output charsets or encodings are unavailable, applications MAY use ACE to encode internationalized domain names for the purpose of ensuring that the data is provided intact. Since ACE is capable of representing UCS characters as sequences of seven-bit characters, it is functionally usable as a last line of defense in almost any environment, with the caveat that ACE encoding sequences are extremely cryptic and will likely result in lower levels of usability and functionality. 6.2. Protocol and Application Data There are several interrelated issues which will determine an application's ability to provide or accept internationalized domain names as protocol or application data, although the principle determining factors for any such usage will generally be the capabilities of the underlying protocol itself. If a protocol allows negotiation or tagging services in order to distinguish between different encodings, that protocol can likely be extended to support the use of UTF-8 as protocol or application data through command/response negotiation options or through data- type tags. Older protocols which do not provide any negotiation services or which mandate the use of US-ASCII in all data will likely require the use of ACE encoded domain names as a short-term measure until the protocol is made compliant with BCP18. * Protocol data. If the protocol supports UTF-8 encoded internationalized domain names in commands or responses, then that encoding SHOULD be used wherever it is allowed. If UTF-8 is not supported by the protocol, STD13 octet sequences and/or ACE encoded equivalents of the internationalized domain name MUST be used. Hall I-D Expires: May 2002 [page 38] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 In some cases, this negotiation can be performed on a per- session basis, while in other cases this work will need to be performed for each transaction within the session, while in other cases the internationalized domain names will have to be tagged whenever they are provided as protocol or application data. The DNS protocol is itself an example of a protocol which requires tagging in order for internationalized domain names to be exchanged within the existing DNS message (with these indicators taking the form of ACE encoding prefixes and EDNS/UTF-8 extended label type codes). Meanwhile, a protocol such as WHOIS can theoretically support a session- wide negotiation option that allowed the use of internationalized domain names as protocol and application data for the duration of that session. Conversely, a protocol such as SMTP will likely require the use of session-specific identifiers for some operations, while other operations may be able to use label tags (similar to the existing support for domain literals, which are identified by a pair of surrounding square brackets). Regardless of the encodings which are used, implementations MUST map domain names and labels to their canonical UCS characters for any normalization and case-conversion work which is subsequently required as part of a DNS lookup (see section 6.3). * Structured application data. Structured application data such as URLs and email addresses MUST be processed according to the rules which govern those data formats. Applications MUST NOT perform any conversion or transliteration which is not explicitly prescribed by the governing documents, since non-standard usages are likely to result in misinterpreted data. * Unstructured application data. Domain names which appear as unstructured data in application content are beyond the control of this specification, and are generally subject to the encoding and formatting desires of the end-users who created the data. Generally speaking, it is RECOMMENDED that applications allow users to enter or view documents in whatever format they prefer, but that any conversion between multiple source and destination charsets and encodings use UCS as the translation intermediary, such Hall I-D Expires: May 2002 [page 39] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 that internationalized domain names are properly converted along with the rest of the application data. In some cases, the application will need to probe the resolver before it can use internationalized domain names as data. For example, a participating system may need to determine the internationalized domain name of the local system so that it can provide this data in a protocol-specific banner message, and in these cases, the application will have to communicate with the resolver before this data can be provided. Due to the usage-specific nature of internationalized domain names within protocol and application data streams, each development group will have to analyze the restrictions and capabilities which affect their specific services independently. 6.3. DNS Lookups and Resolver Calls One of the most frequent uses for domain names is for lookup operations, such as for locating the IP addresses associated with a specified domain name, determining the domain name associated with a specified IP address, or performing a protocol-specific lookup operation for a specific resource record (such as the MX or SOA resource records associated with a specific domain). Since these lookup operations do not directly affect external protocols or data, internationalized domain names can be used for lookup operations at the application's discretion. For example, applications such as ping and netstat only use domain names for display purposes, and can therefore make immediate use of internationalized domain names within their protocol operations. Similarly, a protocol can be limited to STD13 host identifiers as protocol identifiers which will require the application to provide internationalized domain names as ACE encoded sequences, but any lookup operations which are necessary for the internationalized domain names can still be performed in their native form. In these cases, the protocol operations and lookup operations are separate tasks with separate rules. Similarly, applications are not required to use internationalized domain names and internationalized resolver APIs for every lookup. In some cases, it may be more efficient for an application to only use internationalized domain names for lookup operations against connection identifiers, and to use STD13 octet sequences or ACE encoded legacy lookups for domain names which were obtained as Hall I-D Expires: May 2002 [page 40] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 protocol or application data (this will be especially true in those cases where the protocol does not yet provide an internationalized domain name data-type). In those cases where an application prefers to use the legacy resolution path, the application MUST use the resolver's legacy APIs. For lookups against internationalized domain names, the application MUST use the resolver's internationalized APIs. Note that this specification does not define a mandatory encoding which must be used between the applications and the local resolver. However, resolvers MUST provide at least one encoding which is capable of supporting the entire UCS repertoire of character codes, including character codes which are currently unassigned. Since UTF-8 is the default encoding which is used throughout this specification, it is also RECOMMENDED for use with resolver APIs, although this is not required. Resolvers MAY dictate a local encoding, with the only requirement being support for the entire range of UCS character codes. Regardless of the data being provided or the charset or encoding which is used to provide that data, applications MUST normalize and case-convert any internationalized host identifiers which it generates or receives from a lookup operation. This process MUST use the canonical UCS characters of the domain name according to the rules specified in for every host identifier which is sent to or received from a resolver. If the application knows that the requested data specifically refers to a host identifier, then the domain name data which is returned by the resolver MUST be normalized and case-converted, and the resulting domain name MUST be compared to the original domain name which was received prior to the normalization and case-conversion steps. If the processed domain name does not match the domain name which was received, the domain name MUST be discarded as malformed. This step is necessary in order to ensure the integrity and veracity of internationalized domain names which are processed by applications, since there are multiple opportunities for errors to be introduced (such as mistyped entries in the resolver's hosts database, or malicious data which has been purposefully provided in a zone), and these errors can result in sensitive data being directed to the wrong network. Note that the above rule specifically applies to host identifiers and not to all internationalized domain names as a whole; applications MUST NOT arbitrarily normalize and case-convert any and all domain names, Hall I-D Expires: May 2002 [page 41] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 but MUST apply these steps to any and all domain names which are known to be used as host identifiers. As part of the processing rules for DNS lookups, it is expected that an application can exchange internationalized domain names with the resolver using a charset or encoding which is capable of representing the entire UCS character code range. Towards this objective, applications SHOULD test the capabilities of the resolver prior to transferring internationalized domain names. In those situations where the resolver is unable to support this usage, the application MUST encode the internationalized domain name as STD13 octet sequences or ACE, and pass the resulting STD13 host identifier to the resolver. 7. Resolver Guidelines Resolvers play a crucial role in the use of internationalized domain names, in that they provide the internationalized namespace which applications work with. As part of this service, resolvers provide encapsulation services for the internationalized domain names which are exchanged with the applications, resolve queries in the internationalized namespace on behalf of the applications, and provide lookup matching for entries which are stored in a local hosts database. Note that resolvers which cache answer data for subsequent operations are also governed by the caching restrictions provided in section 9. 7.1. Resolver APIs Stub resolvers which communicate directly with applications that are compliant with this specification are strongly encouraged to provide a separate set of APIs for those applications to use whenever internationalized domain names need to be provided in queries or response messages. The use of an internationalized API will generally facilitate smoother operations for the applications, in that it will allow the application to determine the capabilities of the resolver, to obtain the internationalized domain name of the local system, and to process queries for internationalized domain names as special data types. Furthermore, the use of internationalized versus legacy APIs provides a way for resolvers to separate internationalized and Hall I-D Expires: May 2002 [page 42] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 legacy application query paths, such that the legacy APIs only result in STD13 legacy labels, while the internationalized APIs generate and trigger EDNS/UTF-8 extended labels. The output formatting of the DNS messages are controlled by tight restrictions, and the use of alternative APIs will likely result in simpler resolver implementations. For example, it is suggested that applications use the internationalized APIs for all of the DNS lookups they generate, even if the domain name only contains seven-bit characters. This is required in case the queried domain name only exists with a CNAME or PTR resource record which references an internationalized domain name, and the server has to know which encoding to use for that query. If the client had not used the internationalized API for the original lookup of the domain name, the resolver may have chosen the wrong label type, and thus the response data would only be returned as ACE encoded data. Conversely, older applications which generate malformed eight-bit queries through the legacy APIs will result in those queries being properly rejected by the DNS servers, preventing undue problems with these applications from occurring. For example, an older application may process an internationalized domain name through the system-default charset or encoding (such as MacRoman), which would result in the domain name being malformed when the application tried to do something important with that domain name (such as send an email message over SMTP). The use of multiple APIs causes these malformed applications to break, and the invalid domain names are kept out of the application protocol space. Internationalized APIs are optional to the extent that an application MAY use an embedded resolver which is known to be capable of generating and processing internationalized domain names through the existing function calls. However, the use of separate APIs for internationalized domain names is encouraged. Although this document does not mandate any specific APIs, the following functions SHOULD be provided for in some form: * Test Wide. Applications MUST be able to test the resolver for compliance with this specification. In those cases where this function is performed by some other function (such as one of the following), the capabilities of the resolver MUST be detectable even if the requested operation fails. For example, if an application issues a call for the internationalized domain name of the local system, the Hall I-D Expires: May 2002 [page 43] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 capability of the resolver to handle internationalized domain names MUST be uniquely represented even if the local host name cannot be determined. * Get Wide X-By-Y. Applications SHOULD be able to specify any resource record associated with any internationalized domain name as part of a lookup operation. Whether this service is provided as a series of lookup-specific APIs or as a general purpose API is up to the resolver. * Get Wide Local Name. Applications which utilize internationalized domain names as data will need to be able to determine the internationalized form of their local system name for some operations (such as a protocol- specific welcome banner). When this function is called, the resulting data MUST be provided as the canonical UCS character code values, or their equivalent as represented by a locally mandated charset or encoding. Note that an ACE equivalent of the system name SHOULD be returned when the relevant legacy API is queried. In those cases where the legacy and internationalized domain names both contain seven-bit character codes (possibly because the host name is only available in US-ASCII, or because the host name was assigned as ACE by an external configuration service), the internationalized host name MUST still be accessible through the internationalized function. Note that this application does not specify a charset or encoding which must be used by the resolver APIs. However, wherever an internationalized API is presented, the resolver MUST utilize a charset or encoding which supports the entire UCS repertoire of character codes, including character codes which are currently unassigned. Since UTF-8 is the default charset for most of the operations specified in this document, it is also RECOMMENDED for this service, but is not required. 7.2. Query Processing Services Resolvers which are compliant with the recommendations provided in this specification will provide two query paths, one of which supports STD13 domain names and another which supports internationalized domain names. Technically, there is no requirement for two processing paths, although these paths will Hall I-D Expires: May 2002 [page 44] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 likely exist as conceptual paths even if they are not represented or implemented uniquely in all resolvers. The legacy processing path is defined by STD13. This document does not update, modify or extend the rules that resolvers operate under when an STD13 compliant domain name is received by a legacy application through any legacy APIs which may exist. However, when an internationalized domain name is received from an internationalized application through any internationalized APIs, the processing rules defined in this section MUST be followed. Note that these rules apply to all resolvers, whether they are stub resolvers, forwarders or caching servers. Generally speaking, the internationalized domain name resolution process has two major components: processing internationalized domain names as queries, and performing fall-back processing if an EDNS/UTF-8 query is rejected by an authoritative server. 7.2.1. Internationalized queries Queries for internationalized domain names which are received through internationalized APIs can be expected to have originated at an application which is capable of accepting and processing internationalized domain names in the response messages. Resolvers MUST encode the labels from the queried domain name as UTF-8 and encapsulate the resulting encoded labels into EDNS/UTF-8 extended labels for transfer within DNS messages, per the instructions provided in section 5.1. Any and all responses to these queries will also be encoded as UTF-8 and encapsulated in EDNS/UTF-8 extended labels. Resolvers MUST decode the provided response data, convert the labels to their canonical UCS character codes, and return the requested data to the calling application. The resolver MUST NOT normalize or case convert internationalized domain names which may be received in queries or response messages. Since the queries have originated from applications which have indicated that they are compliant with this specification (via the API) while the responses will have originated from caches or servers which indicate that they are also compliant (via the EDNS/UTF-8 extended labels), those systems are assumed to have normalized and case-converted the domain names before they were generated or stored. Also note that applications Hall I-D Expires: May 2002 [page 45] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 will validate the host identifiers that they receive in response messages, so an additional check is expected to be performed on the answer data by those systems. 7.2.2. Fall-back processing If a queried server is unable to process EDNS/UTF-8 extended labels, then it is required by STD13 to generate an error signifying the problem. Resolvers MUST interpret these errors, decode the UTF-8 queried domain name, re-encode it as STD13 octets and/or ACE per the instructions provided in section 5.2, and then reissue the query as an STD13 legacy label sequence. The legacy DNS error responses which will trigger this series of events are FORMERR and NOTIMPL. Any other errors indicate that the EDNS/UTF-8 extended label was successfully processed but that the query was not matched, and those errors MUST be returned to the application. If the fallback processing results in any error responses whatsoever, then the resolver MUST return those errors to the calling application. Any servers which subsequently receive the fall-back queries and which are compliant with this specification will process the queries as internationalized domain names, and will return the answer data as STD13 octet sequences or ACE encoded data, using the STD13 legacy label. Generally speaking, fall-back processing serves two purposes: * Answering the initial query. If a UTF-8 domain name cannot be resolved because a server in the delegation path does not understand the EDNS/UTF-8 label type, the resolver can reissue the query as an ACE encoded legacy label type so that the query proceeds past the problematic server. * Seeding the resolver's cache. As a result of the above, the resolver will learn about the authoritative name servers for the target zone, and this information can be used for any subsequent queries for domain names within the specified zone (for as long as the data is cached, anyway). As such, any subsequent EDNS/UTF-8 queries which are issued for the portion of the namespace served by that zone will be sent directly to one of those authoritative servers where they can be answered directly. In this regard, Hall I-D Expires: May 2002 [page 46] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 subsequent lookups do not require fall-back processing if they are received during the cache window. Regardless of whether or not fall-back processing has been performed, if the calling application issued the original query as an internationalized domain name, then the resolver MUST respond to the query in that form as well. This means that the resolver MUST convert any STD13 octet sequences or ACE encoded labels into their canonical UCS characters, convert the answer data into the resolver's native charset or encoding, and return the data to the calling process. The resolver MUST NOT perform any normalization or case-conversion during this process, as such an action can corrupt domain names which are not used for host identifiers. If the original query was received through the resolver's legacy APIs, then the query MUST be generated and returned in the legacy format, and MUST NOT be converted to an internationalized domain name prior to the query or response being passed through. Once fall-back processing occurs, the process MUST NOT be repeated for any additional queries in the current lookup operation. No other queries from the current lookup operations MUST NOT be sent as EDNS/UTF-8 extended labels, since multiple fall-back operations can result in time-outs on the client systems. Because the fall-back process results in two lookups being issued against the rejecting zone, eliminating the fall-back processing as soon as possible will be an operational requirement for many organizations. Any caches or forwarders which are used by stub resolvers within an end-user network are practically required to be able to process the EDNS/UTF-8 queries, since those servers will receive every query which is issued by the stub resolvers. While this isn't a technical requirement (fall-back processing will get around the problematic servers), it will likely prove to be a consideration for network operators looking to support internationalized domain names on their local networks. This document also strongly encourages the root and TLD servers to be upgraded as soon as possible (even if they do not intend to directly provide UTF-8 domain name delegations), in order to allow those servers to read and process the EDNS/UTF-8 extended labels, thereby reducing the number of fall-back queries which are sent to those servers. Hall I-D Expires: May 2002 [page 47] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 7.3. The Hosts Database Generally speaking, there are two areas of consideration for stub resolvers that provide local hosts databases for name resolution services. These are the input requirements for internationalized domain names which will be added to the hosts database, and the requirements which govern how queries will be compared to the entries in the hosts database. Note that resolvers are not required to implement a hosts database or local lookup services (STD3 says "a host MAY also implement a host name translation mechanism that searches a local Internet host table"). However, wherever a hosts database is provided with an internationalized resolver, compliance with the rules specified in this section is required. If a stub resolver offers the capability to compare internationalized domain names against a local hosts database, that database MUST be compatible with the internationalized domain name rules specified in section 4 of this document. In particular, the resolver SHOULD allow internationalized domain names with any code values to be stored, even if the canonical UCS characters for those values are undefined or are illegal for use with internationalized host identifiers (this is required to support domain names which are not host identifiers). In those cases where an internationalized domain name specifies an exact sequence of octets for binary comparison, the hosts database MUST provide a mechanism for tagging the eight-bit characters so that they are not interpreted, processed or compared as the canonical UCS character equivalents of those codes. However, entries which explicitly provide host identifiers MUST be normalized and case-converted prior to being stored. In order to satisfy both of these requirements, it is RECOMMENDED that hosts databases store internationalized host identifiers as untagged data, but that they also provide some sort of tagging service for character code values which are to be returned as-is. STD13 defines an escaping mechanism whereby the decimal value of the octet is prefaced with a reverse-solidus (such as "\193"), which is suggested for this usage. The storage format of the hosts database MAY use any charset or encoding the resolver deems most suitable for that platform, as long as the rules and restrictions provided above are followed. Since UTF-8 is used as the default encoding throughout this Hall I-D Expires: May 2002 [page 48] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 specification, it is RECOMMENDED as the default encoding for hosts databases as well, although this is not required. Not all of the applications which use a resolver are likely to be compliant with this specification, so resolvers MUST ensure that they are able to interpret and process any queries from the legacy APIs which provide the ACE equivalent of an internationalized domain name that is stored in the hosts database. When such a query arrives, the domain name MUST be converted to the canonical UCS character codes represented by the ACE encoded sequence and compared to entries in the hosts database in that form (tagged octets excluded). Any internationalized domain names which are required to be returned through the legacy APIs MUST be converted to STD13 octet sequences and/or ACE before they are returned. 8. Server Guidelines When a zone administrator desires to provide internationalized domain names in a zone, they are presented with two options: they can add the STD13 octets or ACE encoded internationalized domain names to an existing zone, or they can use internationalized zone databases directly. Both of these usage scenarios have their own benefits and restrictions. Using STD13 octet sequences and ACE with legacy servers allows for the immediate deployment of internationalized domain names on existing servers, and within hierarchies which include internationalized domain names. However, any such queries which originate at applications that are compliant with this specification will always initially fail, guaranteeing that fall- back processing will always occur for those zones. Conversely, using internationalized zones directly allows servers to process legacy, ACE and EDNS/UTF-8 queries equally, thereby providing greater value to the applications and resolvers which have been made compliant with this specification. However, internationalized zones have additional requirements (most notably, they are required to be upgraded simultaneously), and these will prove burdensome to some zone operators. This specification focuses on the processing requirements for internationalized zones which support the use of internationalized domain names as explicit data, and which also support the necessary subordinate mechanisms such as EDNS/UTF-8 queries. When STD13 octet sequences or ACE encoded domain names are used with Hall I-D Expires: May 2002 [page 49] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 legacy servers, the rules defined in STD13 for those servers MUST be used. Note that each zone SHOULD be configurable independently. If a server hosts multiple zones, each of those zones SHOULD be operable as independent entities, with any of them using ACE or internationalized domain names as necessary. This rule is necessary since each zone is likely to have different replication partners and configuration rules which will require different migration strategies. 8.1. Internationalized Zones All domain names which are published by an internationalized zone MUST be compatible with the restrictions specified in section 4 of this document. In particular, the zone database MUST allow binary domain names to be stored as any octet value, but MUST also comply with the normalization and case-mapping rules when a domain name represents a host identifier. These restrictions MUST be applied as part of the process in which the domain name is being added to the zone database. In those cases where an internationalized domain name specifies an exact sequence of octets for binary comparison, the hosts database MUST provide a mechanism for tagging the eight-bit characters so that they are not interpreted, processed or compared as the canonical UCS character equivalents of those codes. STD13 defines an escaping mechanism whereby the decimal value of the octet is prefaced with a reverse-solidus (such as "\193"), which is suggested for this usage. Servers which are compliant with this specification MUST be capable of providing UTF-8 and ACE encoded representations of the UCS domain names which are stored in the zone, and servers MUST restrict output to only one label type for any protocol operation, such that queries containing STD13 legacy labels MUST be answered with STD13 octet sequences and/or ACE encoded domain names, while EDNS/UTF-8 queries MUST only be answered with UTF-8 encoded domain names (this not only includes basic operations such as simple queries, but also includes advanced operations such as zone transfers; see section 8.2). Similarly, external operations such as exporting the contents of the zone to a master file (as discussed in section 8.3) MUST result in a single encoding form being used for that specific operation. Note that the underlying zone database technology which may be employed by any particular server is beyond the scope of this Hall I-D Expires: May 2002 [page 50] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 document. Servers MAY use any database technology, charset or encoding deemed appropriate for the local environment, although the contents of the zone MUST be mapped to the canonical UCS character codes for all comparison operations (octet values excluded). Since UTF-8 is used as the default encoding throughout this specification, it is RECOMMENDED for use as the default encoding with zone databases as well, but is not required. Servers MUST NOT normalize or case-map any UCS characters which are decoded from UTF-8 or ACE encoded labels, and MUST restrict comparison operations of these labels to precise matches of the UCS domain names which are stored in the zone database. However, the seven bit character codes from any labels which are received as STD13 octet sequences MUST be compared in a case-neutral form, and MUST NOT be normalized as part of the comparison operation. When a zone is converted to support internationalized domain names, all of the servers which replicate that zone MUST be upgraded. This is required due to ambiguities that can occur with labels which may be encoded as either STD13 octet sequences or ACE data, and where the label only uses character codes from the eight-bit range of character codes (this problem is described in detail in section 4.1.2). In order to ensure that all of the servers for a zone respond to one of those queries correctly, all of the servers which replicate the zone MUST fully support this document and its requirements. 8.2. Namespace Visibility Restrictions In all cases, the encoding format of the domain names which are returned in response to a query MUST be the same as the encoding format which was used by the query. If the query was provided as a sequence of legacy labels, then all of the domain names which are provided in the response message MUST be provided as legacy labels (containing either ACE or STD13 octet encoded values). Similarly, if a query is provided as EDNS/UTF-8 encoded data, all domain names which are provided in the response message MUST be provided as UTF-8 encoded data in EDNS/UTF-8 extended labels. In some situations, this process may require the server to perform an extra conversion. For example, assume that the .example.com. domain name has two associated MX resource records, one of which points to the UCS domain name of mail..example.com, while the other points to Hall I-D Expires: May 2002 [page 51] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 the ACE encoded domain name of mail..example.net. (where the "" label is the ACE equivalent of an internationalized sub- domain in the example.net. zone). If a UTF-8 query arrives for the MX resource records associated with the .example.com. domain name, both resource records MUST be returned as EDNS/UTF-8 data. In order for this requirement to be satisfied, the server will have to decode the label to its UCS canonical form for zone storage purposes, and encode the domain name as UTF-8 for transmission whenever an EDNS/UTF-8 answer set is required. The visibility rules specified in this section are mandatory for every domain name which is provided in any message. If a system requests a zone transfer and uses the EDNS/UTF-8 extended label type in the request, all of the domain names in all of the messages which are sent as part of the zone transfer MUST be provided in their UTF-8 encoded form. Similarly, if a zone transfer is requested and uses the legacy label type, then all of the domain names from all of the messages which are sent as part of the zone transfer MUST be provided as either STD13 octet sequences or ACE encoded data, using the legacy label type. 8.3. The Master File Format STD13 specifies a "master file" format which is used as a platform-neutral storage and transfer format for importing and exporting the contents of a particular zone. Note that the master file is not the same as the operating database for a zone; the master file format is used (or is useful) for copying a zone to another server, storing a copy of the zone database off-line, emailing a copy of the zone to another user or system, and performing other off-line actions against the database' contents. Once a zone is loaded on a server, however, any database technology can be used for managing the zones and generating response messages. In order to facilitate the continued use of master files, any zone which is compliant with this specification MUST support the use of UTF-8 as an import and export encoding format for the master file associated with that zone. Furthermore, compliant versions of a master file are required to have the "$UTF-8" control literal at the beginning of the first line of text in the master file if it contains UTF-8 encoded data. Master files from zones which do not contain UTF-8 encoded domain Hall I-D Expires: May 2002 [page 52] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 names MUST NOT contain the "$UTF-8" control literal in the first print position of any line. If the master file contains the "$UTF-8" control literal, all of the data within the master file MUST be encoded in UTF-8 as specified by RFC2279, and SHOULD be managed with UTF-8 compliant tools (such as UTF-8 text editors, mailers that support UTF-8 MIME encodings, and so forth). 9. Caching Guidelines Whenever an internationalized domain name is stored in a cache, it MUST be stored in its canonical UCS character code form, regardless of whether the domain name was received as STD13 octet encoding sequences, UTF-8, or ACE data. Caches MUST NOT normalize or case convert any domain names that they store, as such a process could invalidate domain names that are not used for host identifiers. Any subsequent queries which are processed through the cache MUST be compared against the stored UCS characters. Internationalized domain name labels which are decoded from UTF-8 or ACE labels MUST NOT be normalized or case-converted as part of the comparison operation, although labels which are provided as STD13 octet sequences MUST be compared as case-neutral octet values. Caches MUST be capable of providing UTF-8 and ACE encoded representations of the UCS domain names which are stored in the cache, with the appropriate format determined by the format used in the corresponding query. However, answer data MUST be restricted to only one encoding form for any protocol operation, meaning that queries containing legacy labels MUST only be answered with STD13 octet sequences and/or ACE encoded labels, while UTF-8 queries MUST only be answered with UTF-8 encoded domain names. 10. Security Considerations This document defines an extension to the domain name system, and as such, it inherits the weaknesses which already exist in DNS. Where possible, this specification strengthens DNS with multiple checks. For example, this specification requires that domain names be validated three times before they are used by applications: once on specification, once on entry at the authoritative zone or Hall I-D Expires: May 2002 [page 53] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 hosts database, and once again when the answer data is received by the requesting application. Despite these checks, the root weaknesses inherent in DNS are still present. This document uses multiple encoding algorithms, although boundary conditions from the existing DNS are preserved for both the source and encoded representations. 11. IANA Considerations This document requires the use of an EDNS extended label type identification code. This document uses the b000011 ELT code. 12. References [AMC-ACE-Z] , "AMC-ACE-Z version 0.3.1" [NAMEPREP] , "Preparation of Internationalized Host Names" [RFC2119] "Key words for use in RFCs to Indicate Requirement Levels" [RFC952] "DoD Internet host table specification" [STD13] (RFC 1034) "Domain names - concepts and facilities", (RFC 1035) "Domain names - implementation and specification" [STD3] (RFC 1122) "Requirements for Internet Hosts -- Communication Layers", (RFC1123) "Requirements for Internet Hosts -- Application and Support" [BCP18] (RFC 2277) "IETF Policy on Character Sets and Languages" [RFC2279] "UTF-8, a transformation format of ISO 10646" [RFC2671] "Extension Mechanisms for DNS (EDNS0)" [ASCII] "ANSI X3.4-1968. USA Standard Code for Information Interchange" Hall I-D Expires: May 2002 [page 54] INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 [ISO10646] "ISO/IEC 10646-1:2000. International Standard -- Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane" 13. Acknowledgements This document is an assembly of multiple ideas and proposals which have been made on the IDN working group mailing list. Many of the ideas presented here have been proposed by multiple parties in one form or another, although Dan Oscarsson is credited for proposing a dual-mode operation which is capable of simultaneously supporting UTF-8 and legacy mode encodings. Other contributors to key elements from this specification (some of them unknowingly or unwillingly) include (alphabetically) Marc Blanchett, Adam Costello, Mark Davis, Martin Duerst, Patrik Faltstrom, Paul Hoffman, David Hopwood, and many others. 14. Editor's Address Eric A. Hall ehall@ehsco.com Hall I-D Expires: May 2002 [page 55]