idnits 2.17.1 draft-ietf-idn-sace-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 73 has weird spacing: '... value chara...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (27 August 2000) is 8642 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC2279' is defined on line 174, but no explicit reference was found in the text == Unused Reference: 'Unicode' is defined on line 181, but no explicit reference was found in the text == Unused Reference: 'IDNREQ' is defined on line 186, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2279 (Obsoleted by RFC 3629) -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode' -- No information found for draft-ietf-idn-requirement - is the name correct? -- Possible downref: Normative reference to a draft: ref. 'IDNREQ' -- Possible downref: Normative reference to a draft: ref. 'RACE' Summary: 5 errors (**), 0 flaws (~~), 6 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft Dan Oscarsson 2 draft-ietf-idn-sace-00.txt Telia ProSoft 3 Expires: 27 February 2001 27 August 2000 5 Simple ASCII Compatible Encoding (SACE) 7 Status of this memo 9 This document is an Internet-Draft and is in full conformance with 10 all provisions of Section 10 of RFC2026. 12 Internet-Drafts are working documents of the Internet Engineering 13 Task Force (IETF), its areas, and its working groups. Note that other 14 groups may also distribute working documents as Internet-Drafts. 16 Internet-Drafts are draft documents valid for a maximum of six months 17 and may be updated, replaced, or obsoleted by other documents at any 18 time. It is inappropriate to use Internet-Drafts as reference 19 material or to cite them other than as "work in progress." 21 The list of current Internet-Drafts can be accessed at 22 http://www.ietf.org/ietf/1id-abstracts.txt 24 The list of Internet-Draft Shadow Directories can be accessed at 25 http://www.ietf.org/shadow.html. 27 Abstract 29 This document describes a way to encode non-ASCII characters in host 30 names in a way that is completely compatible with the current ASCII 31 only host names that are used in DNS. It can be used both with DNS to 32 support software only handling ASCII host names and as a way to 33 downgrade from 8-bit text to ASCII in protocols. 35 1. Introduction 37 This document defines an ASCII Compatible Encoding (ACE) of names 38 that can be used when communicating with DNS. It is needed during a 39 transition period when non-ASCII names are introduced in DNS to avoid 40 breaking programs expecting ASCII only. 42 The Simple ASCII Compatible Encoding (SACE) defined here can be 43 compared to [RACE]. The main differences are: 44 - RACE encodes by first compressing and the encoding the resulting 45 bit stream into ASCII. SACE encodes each character directly in one 46 pass. 47 - SACE recognises that at lot of latin based names are mostly 48 composed of ASCII characters and gives a higher compression for 49 those. In the 63 byte limit of DNS RACE will allow 36 characters 50 for ISO 8859-1 and less if characters from the additional Latin 51 characters are needed. SACE will allow around 40 characters if 52 about 10 % of a Latin name is non-ASCII (in the UCS [ISO10646] 53 range 0-0x217). SACE is closer to the compression that UTF-8 have 54 than RACE. 55 - Most ASCII characters will not be encoded so Latin based names 56 composed of mostly ASCII characters will be somewhat readable. 58 1.1 Terminology 60 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 61 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 62 document are to be interpreted as described in [RFC2119]. 64 2. Simple ASCII Compatible Encoding 66 The encoding encodes values using the available characters allowed in 67 a ASCII host name (a-z0-9 and hyphen). 69 Values are encoded as follows: 71 Character - value mapping 73 value character value character 74 0 a 18 s 75 1 b 19 t 76 2 c 20 u 77 3 d 21 v 78 4 e 22 w 79 5 f 23 x 80 6 g 24 y 81 7 h 25 z 82 8 i 26 1 83 9 j 27 2 84 10 k 28 3 85 11 l 29 4 86 12 m 30 7 87 13 n 31 9 88 14 o 32 0 89 15 p 33 8 90 16 q 34 5 91 17 r 35 6 93 In the following description the following syntax will be used: 94 B => one value in the range 0-35 mapped to a character as above 95 X => one value in the range 0-31 mapped to a character as above 97 Each UCS character is identified as follows: 98 latin => a character in the range 0-0x217 99 10bit => a character in the range 0x218-0x2FFF 100 base36 => all other characters 102 During encoding/decoding a string a current mode is used. In each 103 mode characters are encoded like this: 104 latin => as themselves, 00 for 0, 88 for 8 or as 10 bit value 105 encoded as 0XX (two 5 bit values) 106 10bit => as 15 bits represented by its current prefix of 5 bits 107 followed by 10 bits encoded as XX 108 (the value is the 15 bits of prefix and 109 10 bits concatenated) 110 base36 => as a base 36 value represented by its current base 36 111 prefix followed by three base 36 digits encoded as BBB 112 (the value is prefix*36*36*36*36+B*36*36+B*36+B) 113 Before encoding the character value must first be 114 reduced: 115 if >= 0xd800 reduce by 8192 (private/surrogate start) 116 then reduce by 0x2FFF. 117 After decoding the character value need to be restored 118 as 119 add 0x2FFF 120 followed by adding 8192 if >= 0xd800 122 2.1 Decoding a string 124 During decode you start with: 125 Mode: latin 126 10bit prefix: 0 127 base36 prefix: 0 129 Then the characters in an encoded string are interpreted as follows 130 depending on current mode: 132 When in latin mode: 133 00 => the character 0 134 0XX => XX represents 10 bits which decodes to one character 135 88 => the character 8 136 85 => switch to 10bit mode with same prefix as last time 137 8X5 => switch 10 10bit mode setting X as current 10bit prefix 138 87 => switch to base36 mode with same prefix as last time 139 8X7 => switch to base36 mode setting X as current base36 prefix 140 other => the characters represent itself 142 When in 10bit mode 143 - => the character - 144 0 => switch to latin mode 145 X5 => switch 10 10bit mode using X as current prefix 146 7 => switch to base36 mode with same prefix as last time 147 X7 => switch to base36 mode using X as current prefix 148 XX => current 10bit prefix plus XX gives the character 150 When in base36 mode 151 -- => the character - 152 -0 => switch to latin mode 153 -5 => switch to 10bit mode with same prefix as last time 154 -X5 => switch 10 10bit mode setting X as current prefix 155 -X7 => switch to base36 mode setting X as current prefix 156 XXX => current base36 prefix plus XXX as base 36 values gives 157 character 159 2.2 Encoding a string 161 To encode a string you start with the data as UCS characters and: 162 Mode: latin 163 10bit prefix: 0 164 base36 prefix: 0 166 Then for each UCS character, the mode and/or prefix is switched if 167 needed and then the character is encoded as defined above. 169 3. References 171 [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate 172 Requirement Levels", March 1997, RFC 2119. 174 [RFC2279] F. Yergeau, "UTF-8, a transformation format of ISO 10646", 175 RFC 2279, January 1998. 177 [ISO10646] ISO/IEC 10646-1:2000. International Standard -- 178 Information technology -- Universal Multiple-Octet Coded 179 Character Set (UCS) 181 [Unicode] The Unicode Consortium, "The Unicode Standard -- Version 182 3.0", ISBN 0-201-61633-5. Described at 183 http://www.unicode.org/unicode/standard/versions/ 184 Unicode3.0.html 186 [IDNREQ] James Seng, "Requirements of Internationalized Domain 187 Names", draft-ietf-idn-requirement. 189 [RACE] Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding 190 for IDN", draft-ietf-idn-race. 192 4. Acknowledgements 194 Paul Hoffman for many good ideas. 196 Author's Address 198 Dan Oscarsson 199 Telia ProSoft AB 200 Box 85 201 201 20 Malmo 202 Sweden 204 E-mail: Dan.Oscarsson@trab.se