Internet Draft                                            Dan Oscarsson
draft-ietf-idn-sace-00.txt                                Telia ProSoft
Expires: 27 February 2001                                 27 August 2000

                Simple ASCII Compatible Encoding (SACE)

Status of this memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.


Abstract

   This document describes a way to encode non-ASCII characters in host
   names in a way that is completely compatible with the current ASCII
   only host names that are used in DNS. It can be used both with DNS to
   support software only handling ASCII host names and as a way to
   downgrade from 8-bit text to ASCII in protocols.


1. Introduction

   This document defines an ASCII Compatible Encoding (ACE) of names
   that can be used when communicating with DNS. It is needed during a
   transition period when non-ASCII names are introduced in DNS to avoid
   breaking programs expecting ASCII only.

   The Simple ASCII Compatible Encoding (SACE) defined here can be
   compared to [RACE]. The main differences are:
    - RACE encodes by first compressing and the encoding the resulting
      bit stream into ASCII. SACE encodes each character directly in one


Dan Oscarsson          Expires: 27 Februray 2001                [Page 1]

Internet Draft                    SACE                    27 August 2000


      pass.
    - SACE recognises that at lot of latin based names are mostly
      composed of ASCII characters and gives a higher compression for
      those.  In the 63 byte limit of DNS RACE will allow 36 characters
      for ISO 8859-1 and less if characters from the additional Latin
      characters are needed. SACE will allow around 40 characters if
      about 10 % of a Latin name is non-ASCII (in the UCS [ISO10646]
      range 0-0x217).  SACE is closer to the compression that UTF-8 have
      than RACE.
    - Most ASCII characters will not be encoded so Latin based names
      composed of mostly ASCII characters will be somewhat readable.


1.1 Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

2. Simple ASCII Compatible Encoding

   The encoding encodes values using the available characters allowed in
   a ASCII host name (a-z0-9 and hyphen).

   Values are encoded as follows:

                    Character - value mapping

      value  character               value  character
         0       a                     18       s
         1       b                     19       t
         2       c                     20       u
         3       d                     21       v
         4       e                     22       w
         5       f                     23       x
         6       g                     24       y
         7       h                     25       z
         8       i                     26       1
         9       j                     27       2
        10       k                     28       3
        11       l                     29       4
        12       m                     30       7
        13       n                     31       9
        14       o                     32       0
        15       p                     33       8
        16       q                     34       5
        17       r                     35       6


Dan Oscarsson          Expires: 27 Februray 2001                [Page 2]

Internet Draft                    SACE                    27 August 2000


   In the following description the following syntax will be used:
      B => one value in the range 0-35 mapped to a character as above
      X => one value in the range 0-31 mapped to a character as above

   Each UCS character is identified as follows:
      latin  => a character in the range 0-0x217
      10bit  => a character in the range 0x218-0x2FFF
      base36 => all other characters

   During encoding/decoding a string a current mode is used. In each
   mode characters are encoded like this:
      latin  => as themselves, 00 for 0, 88 for 8 or as 10 bit value
                encoded as 0XX (two 5 bit values)
      10bit  => as 15 bits represented by its current prefix of 5 bits
                followed by 10 bits encoded as XX
                (the value is the 15 bits of prefix and
                10 bits concatenated)
      base36 => as a base 36 value represented by its current base 36
                prefix followed by three base 36 digits encoded as BBB
                (the value is prefix*36*36*36*36+B*36*36+B*36+B)
                Before encoding the character value must first be
                reduced:
                  if >= 0xd800 reduce by 8192 (private/surrogate start)
                  then reduce by 0x2FFF.
                After decoding the character value need to be restored
   as
                  add 0x2FFF
                  followed by adding 8192 if >= 0xd800


2.1 Decoding a string

   During decode you start with:
      Mode: latin
      10bit prefix: 0
      base36 prefix: 0

   Then the characters in an encoded string are interpreted as follows
   depending on current mode:

    When in latin mode:
      00  => the character 0
      0XX => XX represents 10 bits which decodes to one character
      88  => the character 8
      85  => switch to 10bit mode with same prefix as last time
      8X5 => switch 10 10bit mode setting X as current 10bit prefix
      87  => switch to base36 mode with same prefix as last time
      8X7 => switch to base36 mode setting X as current base36 prefix


Dan Oscarsson          Expires: 27 Februray 2001                [Page 3]

Internet Draft                    SACE                    27 August 2000


      other  => the characters represent itself

    When in 10bit mode
      - => the character -
      0 => switch to latin mode
      X5 => switch 10 10bit mode using X as current prefix
      7  => switch to base36 mode with same prefix as last time
      X7 => switch to base36 mode using X as current prefix
      XX => current 10bit prefix plus XX gives the character

    When in base36 mode
      -- => the character -
      -0 => switch to latin mode
      -5 => switch to 10bit mode with same prefix as last time
      -X5 => switch 10 10bit mode setting X as current prefix
      -X7 => switch to base36 mode setting X as current prefix
      XXX => current base36 prefix plus XXX as base 36 values gives
   character


   2.2 Encoding a string

   To encode a string you start with the data as UCS characters and:
      Mode: latin
      10bit prefix: 0
      base36 prefix: 0

   Then for each UCS character, the mode and/or prefix is switched if
   needed and then the character is encoded as defined above.


3. References

   [RFC2119]  Scott Bradner, "Key words for use in RFCs to Indicate
              Requirement Levels", March 1997, RFC 2119.

   [RFC2279]  F. Yergeau, "UTF-8, a transformation format of ISO 10646",
              RFC 2279, January 1998.

   [ISO10646] ISO/IEC 10646-1:2000. International Standard --
              Information technology -- Universal Multiple-Octet Coded
              Character Set (UCS)

   [Unicode]  The Unicode Consortium, "The Unicode Standard -- Version
              3.0", ISBN 0-201-61633-5. Described at
              http://www.unicode.org/unicode/standard/versions/
              Unicode3.0.html


Dan Oscarsson          Expires: 27 Februray 2001                [Page 4]

Internet Draft                    SACE                    27 August 2000


   [IDNREQ]   James Seng, "Requirements of Internationalized Domain
              Names", draft-ietf-idn-requirement.

   [RACE]     Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding
   for IDN", draft-ietf-idn-race.

4. Acknowledgements

   Paul Hoffman for many good ideas.


Author's Address

   Dan Oscarsson
   Telia ProSoft AB
   Box 85
   201 20 Malmo
   Sweden

   E-mail: Dan.Oscarsson@trab.se


Dan Oscarsson          Expires: 27 Februray 2001                [Page 5]