Network Working Group                                         J. Klensin
Internet-Draft                                              M. Padlipsky
Expires: October 31, 2006                                 April 29, 2006


                 Unicode Format for Network Interchange
                     draft-klensin-net-utf8-00.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on October 31, 2006.

Copyright Notice

   Copyright (C) The Internet Society (2006).

Abstract

   The Internet today is in need of a standardized form for the
   transmission of internationalized "text" information, paralleling the
   specifications for the use of ASCII that date from the early days of
   the ARPANET.  This document specifies that format, using UTF-8 with
   specification of normalization and specific line-ending sequences.


Klensin & Padlipsky     Expires October 31, 2006                [Page 1]

Internet-Draft                Network UTF-8                   April 2006


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
     1.1.  Background . . . . . . . . . . . . . . . . . . . . . . . .  3
     1.2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  4
   2.  Net-Unicode  . . . . . . . . . . . . . . . . . . . . . . . . .  4
   3.  Normalization  . . . . . . . . . . . . . . . . . . . . . . . .  5
   4.  Versions of Unicode  . . . . . . . . . . . . . . . . . . . . .  5
   5.  Context of this Proposal . . . . . . . . . . . . . . . . . . .  6
   6.  Security Considerations  . . . . . . . . . . . . . . . . . . .  7
   7.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . .  7
   8.  References . . . . . . . . . . . . . . . . . . . . . . . . . .  7
     8.1.  Normative References . . . . . . . . . . . . . . . . . . .  7
     8.2.  Informative References . . . . . . . . . . . . . . . . . .  8
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10
   Intellectual Property and Copyright Statements . . . . . . . . . . 11


Klensin & Padlipsky     Expires October 31, 2006                [Page 2]

Internet-Draft                Network UTF-8                   April 2006


1.  Introduction

1.1.  Background

   This subsection contains a review of prior work in the ARPANET and
   Internet to establish a standard text type, work that establishes the
   context and motivation for the approach taken in this document.
   Those who are uninterested in that review and analysis can safely
   skip to the next section.

   One of the earlier application design decisions made in the
   development of ARPANET, a decision that was carried forward into the
   Internet, was the decision to standardize on a single and very
   specific coding for "text" to be passed across the network [RFC0020].
   Hosts on the network were then responsible for translating or mapping
   from whatever character coding conventions were used locally to that
   common intermediate representation, with sending hosts mapping to it
   and receiving ones mapping from it to their local forms as needed.
   It is interesting to note that at the time the ARPANET was being
   developed, participating Host operating systems used at least three
   different character coding standards: the antiquated BCD (Binary
   Coded Decimal), the then-dominant major manufacturer-backed EBCDIC
   (Extendended BCD Interchange Code), and the then-still emerging ASCII
   (American Standard Code for Information Interchange).  Since the
   ARPANET was an "open" project and EBCDIC was intimately linked to a
   particular hardware vendor, the original Network Working Group agreed
   that its standard should be ASCII.  That ASCII form was precisely
   "7-bit ASCII in an 8-bit field", which was in effect a compromise
   between Hosts that were natively 7-bit oriented (e.g., with five
   seven-bit characters in a 36 bit word), those that were 8-bit
   oriented (using eight-bit characers) and those that placed the seven-
   bit ASCII characters in 9-bit fields with two leading zero bits (four
   characters in a 36 bit word).

   More standardization was suggested in the first preliminary
   description of the Telnet protocol [RFC0097].  With the iterations of
   that protocol [RFC0137] [RFC0139] and the drawing together of an
   essentially formal definition somewhat later [RFC0318], a standard
   abstraction, the Network Virtual Terminal (NVT) was established.  NVT
   character-coding conventions (initially called "Telnet ASCII" and
   later called "NVT ASCII", or, more casually, "network ASCII")
   included the requirement that Carriage Return - Line Feed (CRLF) be
   the common representation for ending lines of text (given that some
   participating "Host" operating systems used the one natively, some
   the other, and at least one used both) and specified conventions for
   some other characters.  Also, since NVT ASCII was restricted to
   seven-bit characters, use of the high-order bit in octets was
   reserved for the transmission of control signaling information.


Klensin & Padlipsky     Expires October 31, 2006                [Page 3]

Internet-Draft                Network UTF-8                   April 2006


   At a very high level, the concept was that a system could use
   whatever character coding and line representations were appropriate
   locally, but text transmitted over the network as text must conform
   to the single "network virtual terminal" convention.

   In our more internationalized world, "text" clearly no longer equates
   unambiguously to "network ASCII".  Fortunately, however, we are
   converging on Unicode [Unicode], [ISO10646] as a single international
   interchange character coding and no longer need to deal with per-
   script standards for character sets (e.g., one standard for each of
   Arabic, Cyrillic, Devanagari, etc., or even standards keyed to
   languages that are usually considered to share a script, such as
   French, German, or Swedish).  Unfortunately, though, while it is
   certainly time to define a Unicode-based text type for use as a
   common text interchange format, "use Unicode" involves even more
   ambiguity than "use ASCII" did decades ago.  Unicode can be
   transmitted in UCS-2 (all characters 16 bit), UCS-4 or UTF-16 (all
   characters 32 bit), UTF-8 (a variable-length encoding with some
   additional properties) [RFC3629], and other forms.  Also, as with
   ASCII, any of these forms may have different line-ending conventions.

   This document proposes to establish "Net-Unicode" as a new
   standardized text transmission form for the Internet, to serve as an
   internationalized alternative for NVT ASCII when specified in new --
   and, where appropriate, updated -- protocols.  UTF-8 [RFC3629] is
   chosen for the coding because it has good compatibility properties
   with ASCII and for other reasons discussed in the existing IETF
   character set policy [RFC2277].

   In circumstances in which there is a choice, use of Unicode and the
   text encoding specified here is preferred to the double-byte encoding
   of "extended ASCII" [RFC0698] and SHOULD be used.

1.2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].


2.  Net-Unicode

   The Network Unicode (Net-Unicode or Net-UTF-8) format is defined as
   follows:

   1.  Characters will be coded in UTF-8 as defined in [RFC3629]


Klensin & Padlipsky     Expires October 31, 2006                [Page 4]

Internet-Draft                Network UTF-8                   April 2006


   2.  Line-endings will be indicated by the sequence Carriage-Return
       (U+000D) followed by Line-Feed (U+000A)
   3.  Before transmission, all character sequences will be normalized
       according to Unicode method "NFC" (see Section 3).
   4.  As suggested in Section 6 of RFC 3629, the Byte Order Mark
       ("BOM") signature MUST NOT appear at the beginning of these text
       strings.

   The NVT specification contained a number of additional provisions,
   e.g., for the optional use of backspacing and "bare CR" (sent as CR
   NUL) to generate overstruck character sequences.  The much greater
   number of precomposed characters in Unicode, the availability of
   combining characters, and the growing use of markup conventions of
   various types (rather than character-coding) to show, e.g., emphasis,
   should make such sequences largely unnecessary.  Because they were
   optional in NVT applications, they SHOULD be avoided if at all
   possible; if they are used, this specification does not change the
   NVT rules and conventions of RFC 318 and RFC 854 [RFC0854] including
   the prohibition on CR without NUL (note that NUL, X'00' is hostile to
   programming languages that use that character as a string delimiter).


3.  Normalization

   There are a number of characters in Unicode that can be represented
   in two ways, either as a single precomposed character or as the
   combination of a base character and one or more combining characters.
   If any combination of ways to represent characters is permitted in
   the data stream, comparison and use of those strings becomes
   difficult or impossible.  The Unicode Consortium specifies a
   normalization method, known as NFC [NFC], which provides the
   necessary mappings to convert sequences of combining characters into
   a precomposed character when that is feasible.


4.  Versions of Unicode

   In retrospect, one of the advantages of ASCII [X3.4-1978] when it was
   chosen was that the code space was full when the Standard was first
   published.  There was no practical way to add characters or change
   code point assignments without being obviously incompatible.  Unicode
   does not have that property: there are large blocks of space reserved
   for future expansion and new versions, with new characters and code
   point assignments, appear at regular intervals.

   While there are some security issues if people deliberately try to
   trick the system (see Section 6), Unicode version changes should not
   have a significant impact on the text stream specification of this


Klensin & Padlipsky     Expires October 31, 2006                [Page 5]

Internet-Draft                Network UTF-8                   April 2006


   document for the following reasons:

   o  The transformation between Unicode code table positions and the
      corresponding UTF-8 code is algorithmic; it does not depend on
      whether a code point has been assigned or not.

   o  The normalization specified here, NFC (see Section 3), performs a
      very limited set of mappings, much more limited than those of the
      more extensive NFKC used in, e.g., nameprep [RFC3491].  Assuming
      that the Unicode Consortium is consistent with its stated rules
      and does not add any more precomposed characters whose equivalents
      can be built up from composing characters, the NFC tables should
      be completely forward and backward-stable, with no additions or
      changes for characters added in future versions of the Standard.

   Were Unicode to be changed in a way that violated these assumptions,
   i.e., that either invalidated the string order of RFC 3629 or that
   required additions or changes to the mappings of NFC, this
   specification would not apply.  Put differently, this specification
   applies only to versions of Unicode starting with version 3.2 and
   extending to, but not including, any version for which those
   conditions do not apply.  This specification therefore applies to
   versions of Unicode through and including the version current when it
   is written, version 4.1.0 [Unicode410].  Subsequent versions would
   require changes to this specification, presumably including string-
   type labeling in all cases.  Where this specification is referenced
   in a specification or implementation, otherwise unidentified UTF-8
   strings are to be treated as conforming to it.


5.  Context of this Proposal

   [[anchor5: RFC Editor: This section to be removed before publication
   if it lasts even that long]]

   There has been some small amount of confusion about the motivation
   for this proposal given that, e.g., MIME and HTTP have their own
   rules about UTF-8 character types.  The answer is that we have
   several protocols are dependent on either Telnet or other
   arrangements requiring a standard, interoperable, string definition
   without specific content-labels of one sort or another.  In
   particular, if this proposal is approved, or even appears to be
   getting significant traction, it will be immediately followed by a
   Telnet option to specify this type of stream (requiring some special
   provisions for Telnet control codes, of course) and an FTP extension
   to permit a new "Unicode text" data TYPE.


Klensin & Padlipsky     Expires October 31, 2006                [Page 6]

Internet-Draft                Network UTF-8                   April 2006


6.  Security Considerations

   This specification provides a standard form for the use of Unicode as
   "network text".  The same security issues that apply to UTF-8, and
   discussed in [RFC3629] could be argued for it, although it should be
   slightly less subject to some risks by virtue of requiring NFC
   normalization and generally being somewhat more restrictive.

   While not specifically a security issue, the requirement in NVT, and
   hence here, that the CR character never appear alone but only when
   followed by ASCII NUL (an octet with all bits zero) may be
   problematic for some programming languages, and hence a trap for the
   unwary, unless caution is used.  This may be an additional reason to
   avoid the use of CR entirely, as suggested above.

   The discussion about Unicode versions above (Section 4) makes several
   assumptions about future versions of Unicode, about NFC normalization
   being applied properly, and about UTF-8 being processed and
   transmitted exactly as specified [RFC3629].  If any of those
   assumptions are not correct, then there are cases in which strings
   that would be considered equivalent do not compare equal.  Robust
   code should be prepared for those possibilities.


7.  Acknowledgments

   Many thanks to Mark Davis, Martin Duerst, and Michel Suignard for
   suggestions about Unicode normalization that led to the format
   described here.


8.  References

8.1.  Normative References

   [ISO10646]
              International Organization for Standardization,
              "Information Technology - Universal Multiple- Octet Coded
              Character Set (UCS) - Part 1: Architecture and Basic
              Multilingual Plane"", ISO/IEC 10646-1:2000, October 2000.

   [NFC]      Davis, M. and M. Duerst, "Unicode Standard Annex #15:
              Unicode Normalization Forms", March 2005,
              <http://www.unicode.org/reports/tr15/>.

   [RFC0137]  O'Sullivan, T., "Telnet Protocol - a proposed document",
              RFC 137, April 1971.


Klensin & Padlipsky     Expires October 31, 2006                [Page 7]

Internet-Draft                Network UTF-8                   April 2006


   [RFC0139]  O'Sullivan, T., "Discussion of Telnet Protocol", RFC 139,
              May 1971.

   [RFC0318]  Postel, J., "Telnet Protocols", RFC 318, April 1972.

   [RFC0854]  Postel, J. and J. Reynolds, "Telnet Protocol
              Specification", STD 8, RFC 854, May 1983.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
              10646", STD 63, RFC 3629, November 2003.

   [Unicode]  The Unicode Consortium, "The Unicode Standard, Version
              3.0", 2000.

              (Reading, MA, Addison-Wesley, 2000.  ISBN 0-201-61633-5).
              Version 3.2 consists of the definition in that book as
              amended by the Unicode Standard Annex #27: Unicode 3.1
              (http://www.unicode.org/reports/tr27/) and by the Unicode
              Standard Annex #28: Unicode 3.2
              (http://www.unicode.org/reports/tr28/).

   [Unicode410]
              The Unicode Consortium, "The Unicode Standard, Version
              4.1.0", March 2005.

              Defined by: The Unicode Standard, Version 4.0 (Boston, MA,
              Addison-Wesley, 2003.  ISBN 0-321-18578-1), as amended by
              Unicode 4.0.1
              (http://www.unicode.org/versions/Unicode4.0.1) and by
              Unicode 4.1.0
              (http://www.unicode.org/versions/Unicode4.1.0).

8.2.  Informative References

   [ISO.2022.1986]
              International Organization for Standardization,
              "Information Processing: ISO 7-bit and 8-bit coded
              character sets: Code extension techniques", ISO Standard
              2022, 1986.

   [ISO.646.1991]
              International Organization for Standardization,
              "Information technology - ISO 7-bit coded character set
              for information interchange", ISO Standard 646, 1991.


Klensin & Padlipsky     Expires October 31, 2006                [Page 8]

Internet-Draft                Network UTF-8                   April 2006


   [ISO.8859.2003]
              International Organization for Standardization,
              "Information processing - 8-bit single-byte coded graphic
              character sets - Part 1: Latin alphabet No. 1 (1998) -
              Part 2: Latin alphabet No. 2 (1999) - Part 3: Latin
              alphabet No. 3 (1999) - Part 4: Latin alphabet No. 4
              (1998) - Part 5: Latin/Cyrillic alphabet (1999) - Part 6:
              Latin/Arabic alphabet (1999) - Part 7: Latin/Greek
              alphabet (2003) - Part 8: Latin/Hebrew alphabet (1999) -
              Part 9: Latin alphabet No. 5 (1999) - Part 10: Latin
              alphabet No. 6 (1998) - Part 11: Latin/Thai alphabet
              (2001) - Part 13: Latin alphabet No. 7 (1998) - Part 14:
              Latin alphabet No. 8 (Celtic) (1998) - Part 15: Latin
              alphabet No. 9 (1999) - Part 16: Part 16: Latin alphabet
              No. 10 (2001)", ISO Standard 8859, 2003.

   [RFC0020]  Cerf, V., "ASCII format for network interchange", RFC 20,
              October 1969.

   [RFC0097]  Melvin, J. and R. Watson, "First Cut at a Proposed Telnet
              Protocol", RFC 97, February 1971.

   [RFC0698]  Mock, T., "Telnet extended ASCII option", RFC 698,
              July 1975.

   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
              Languages", BCP 18, RFC 2277, January 1998.

   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
              Profile for Internationalized Domain Names (IDN)",
              RFC 3491, March 2003.

   [X3.4-1978]
              American National Standards Institute (formerly United
              States of America Standards Institute), "USA Code for
              Information Interchange", ANSI X3.4-1968, 1968.

              ANSI X3.4-1968 has been replaced by newer versions with
              slight modifications, but the 1968 version remains
              definitive for the Internet.


Klensin & Padlipsky     Expires October 31, 2006                [Page 9]

Internet-Draft                Network UTF-8                   April 2006


Authors' Addresses

   John C Klensin
   1770 Massachusetts Ave, #322
   Cambridge, MA  02140
   USA

   Phone: +1 617 491 5735
   Email: john-ietf@jck.com


   Michael A. Padlipsky
   8011 Stewart Ave.
   Los Angeles, CA  90045
   USA

   Phone: +1 310-670-4288
   Email: the.map@alum.mit.edu


Klensin & Padlipsky     Expires October 31, 2006               [Page 10]

Internet-Draft                Network UTF-8                   April 2006


Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.


Disclaimer of Validity

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Copyright Statement

   Copyright (C) The Internet Society (2006).  This document is subject
   to the rights, licenses and restrictions contained in BCP 78, and
   except as set forth therein, the authors retain all their rights.


Acknowledgment

   Funding for the RFC Editor function is currently provided by the
   Internet Society.


Klensin & Padlipsky     Expires October 31, 2006               [Page 11]