Internet Draft Dan Oscarsson draft-ietf-idn-sace-00.txt Telia ProSoft Expires: 27 February 2001 27 August 2000 Simple ASCII Compatible Encoding (SACE) Status of this memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes a way to encode non-ASCII characters in host names in a way that is completely compatible with the current ASCII only host names that are used in DNS. It can be used both with DNS to support software only handling ASCII host names and as a way to downgrade from 8-bit text to ASCII in protocols. 1. Introduction This document defines an ASCII Compatible Encoding (ACE) of names that can be used when communicating with DNS. It is needed during a transition period when non-ASCII names are introduced in DNS to avoid breaking programs expecting ASCII only. The Simple ASCII Compatible Encoding (SACE) defined here can be compared to [RACE]. The main differences are: - RACE encodes by first compressing and the encoding the resulting bit stream into ASCII. SACE encodes each character directly in one Dan Oscarsson Expires: 27 Februray 2001 [Page 1] Internet Draft SACE 27 August 2000 pass. - SACE recognises that at lot of latin based names are mostly composed of ASCII characters and gives a higher compression for those. In the 63 byte limit of DNS RACE will allow 36 characters for ISO 8859-1 and less if characters from the additional Latin characters are needed. SACE will allow around 40 characters if about 10 % of a Latin name is non-ASCII (in the UCS [ISO10646] range 0-0x217). SACE is closer to the compression that UTF-8 have than RACE. - Most ASCII characters will not be encoded so Latin based names composed of mostly ASCII characters will be somewhat readable. 1.1 Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 2. Simple ASCII Compatible Encoding The encoding encodes values using the available characters allowed in a ASCII host name (a-z0-9 and hyphen). Values are encoded as follows: Character - value mapping value character value character 0 a 18 s 1 b 19 t 2 c 20 u 3 d 21 v 4 e 22 w 5 f 23 x 6 g 24 y 7 h 25 z 8 i 26 1 9 j 27 2 10 k 28 3 11 l 29 4 12 m 30 7 13 n 31 9 14 o 32 0 15 p 33 8 16 q 34 5 17 r 35 6 Dan Oscarsson Expires: 27 Februray 2001 [Page 2] Internet Draft SACE 27 August 2000 In the following description the following syntax will be used: B => one value in the range 0-35 mapped to a character as above X => one value in the range 0-31 mapped to a character as above Each UCS character is identified as follows: latin => a character in the range 0-0x217 10bit => a character in the range 0x218-0x2FFF base36 => all other characters During encoding/decoding a string a current mode is used. In each mode characters are encoded like this: latin => as themselves, 00 for 0, 88 for 8 or as 10 bit value encoded as 0XX (two 5 bit values) 10bit => as 15 bits represented by its current prefix of 5 bits followed by 10 bits encoded as XX (the value is the 15 bits of prefix and 10 bits concatenated) base36 => as a base 36 value represented by its current base 36 prefix followed by three base 36 digits encoded as BBB (the value is prefix*36*36*36*36+B*36*36+B*36+B) Before encoding the character value must first be reduced: if >= 0xd800 reduce by 8192 (private/surrogate start) then reduce by 0x2FFF. After decoding the character value need to be restored as add 0x2FFF followed by adding 8192 if >= 0xd800 2.1 Decoding a string During decode you start with: Mode: latin 10bit prefix: 0 base36 prefix: 0 Then the characters in an encoded string are interpreted as follows depending on current mode: When in latin mode: 00 => the character 0 0XX => XX represents 10 bits which decodes to one character 88 => the character 8 85 => switch to 10bit mode with same prefix as last time 8X5 => switch 10 10bit mode setting X as current 10bit prefix 87 => switch to base36 mode with same prefix as last time 8X7 => switch to base36 mode setting X as current base36 prefix Dan Oscarsson Expires: 27 Februray 2001 [Page 3] Internet Draft SACE 27 August 2000 other => the characters represent itself When in 10bit mode - => the character - 0 => switch to latin mode X5 => switch 10 10bit mode using X as current prefix 7 => switch to base36 mode with same prefix as last time X7 => switch to base36 mode using X as current prefix XX => current 10bit prefix plus XX gives the character When in base36 mode -- => the character - -0 => switch to latin mode -5 => switch to 10bit mode with same prefix as last time -X5 => switch 10 10bit mode setting X as current prefix -X7 => switch to base36 mode setting X as current prefix XXX => current base36 prefix plus XXX as base 36 values gives character 2.2 Encoding a string To encode a string you start with the data as UCS characters and: Mode: latin 10bit prefix: 0 base36 prefix: 0 Then for each UCS character, the mode and/or prefix is switched if needed and then the character is encoded as defined above. 3. References [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate Requirement Levels", March 1997, RFC 2119. [RFC2279] F. Yergeau, "UTF-8, a transformation format of ISO 10646", RFC 2279, January 1998. [ISO10646] ISO/IEC 10646-1:2000. International Standard -- Information technology -- Universal Multiple-Octet Coded Character Set (UCS) [Unicode] The Unicode Consortium, "The Unicode Standard -- Version 3.0", ISBN 0-201-61633-5. Described at http://www.unicode.org/unicode/standard/versions/ Unicode3.0.html Dan Oscarsson Expires: 27 Februray 2001 [Page 4] Internet Draft SACE 27 August 2000 [IDNREQ] James Seng, "Requirements of Internationalized Domain Names", draft-ietf-idn-requirement. [RACE] Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding for IDN", draft-ietf-idn-race. 4. Acknowledgements Paul Hoffman for many good ideas. Author's Address Dan Oscarsson Telia ProSoft AB Box 85 201 20 Malmo Sweden E-mail: Dan.Oscarsson@trab.se Dan Oscarsson Expires: 27 Februray 2001 [Page 5]