idnits 2.17.1 

draft-klensin-net-utf8-09.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 18.

  -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
     line 950.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 961.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 968.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 974.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == The 'Obsoletes: ' line in the draft header should list only the
     _numbers_ of the RFCs which will be obsoleted by this document (if
     approved); it should not include the word 'RFC' in the list.

  == The 'Updates: ' line in the draft header should list only the _numbers_
     of the RFCs which will be updated by this document (if approved); it
     should not include the word 'RFC' in the list.

  -- The draft header indicates that this document obsoletes RFC698, but the
     abstract doesn't seem to mention this, which it should.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust Copyright Line does not match the
     current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (February 10, 2008) is 5920 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'NFC'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode32'

  -- Obsolete informational reference (is this intentional?): RFC  542
     (Obsoleted by RFC 765)

  -- Obsolete informational reference (is this intentional?): RFC  698
     (Obsoleted by RFC 5198)

  -- Obsolete informational reference (is this intentional?): RFC  742
     (Obsoleted by RFC 1194, RFC 1196, RFC 1288)

  -- Obsolete informational reference (is this intentional?): RFC  954
     (Obsoleted by RFC 3912)

  -- Obsolete informational reference (is this intentional?): RFC 2616
     (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  -- Obsolete informational reference (is this intentional?): RFC 2821
     (Obsoleted by RFC 5321)

  -- Obsolete informational reference (is this intentional?): RFC 3454
     (Obsoleted by RFC 7564)

  -- Obsolete informational reference (is this intentional?): RFC 3491
     (Obsoleted by RFC 5891)


     Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 20 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                         J. Klensin
3	Internet-Draft                                              M. Padlipsky
4	Obsoletes: RFC 698                                     February 10, 2008
5	(if approved)
6	Updates: RFC854 (if approved)
7	Intended status: Standards Track
8	Expires: August 13, 2008

10	                 Unicode Format for Network Interchange
11	                     draft-klensin-net-utf8-09.txt

13	Status of this Memo

15	   By submitting this Internet-Draft, each author represents that any
16	   applicable patent or other IPR claims of which he or she is aware
17	   have been or will be disclosed, and any of which he or she becomes
18	   aware will be disclosed, in accordance with Section 6 of BCP 79.

20	   Internet-Drafts are working documents of the Internet Engineering
21	   Task Force (IETF), its areas, and its working groups.  Note that
22	   other groups may also distribute working documents as Internet-
23	   Drafts.

25	   Internet-Drafts are draft documents valid for a maximum of six months
26	   and may be updated, replaced, or obsoleted by other documents at any
27	   time.  It is inappropriate to use Internet-Drafts as reference
28	   material or to cite them other than as "work in progress."

30	   The list of current Internet-Drafts can be accessed at
31	   http://www.ietf.org/ietf/1id-abstracts.txt.

33	   The list of Internet-Draft Shadow Directories can be accessed at
34	   http://www.ietf.org/shadow.html.

36	   This Internet-Draft will expire on August 13, 2008.

38	Copyright Notice

40	   Copyright (C) The IETF Trust (2008).

42	Abstract

44	   The Internet today is in need of a standardized form for the
45	   transmission of internationalized "text" information, paralleling the
46	   specifications for the use of ASCII that date from the early days of
47	   the ARPANET.  This document specifies that format, using UTF-8 with
48	   normalization and specific line-ending sequences.

50	Table of Contents

52	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
53	     1.1.  Requirement for a Standardized Text Stream Format  . . . .  3
54	     1.2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  4
55	     1.3.  Mailing List . . . . . . . . . . . . . . . . . . . . . . .  4
56	   2.  Net-Unicode Definition . . . . . . . . . . . . . . . . . . . .  4
57	   3.  Normalization  . . . . . . . . . . . . . . . . . . . . . . . .  6
58	   4.  Versions of Unicode  . . . . . . . . . . . . . . . . . . . . .  6
59	   5.  Applicability and Stability of this Specification  . . . . . .  8
60	     5.1.  Use in IETF Applications Specifications  . . . . . . . . .  8
61	     5.2.  Unicode Versions and Applicability . . . . . . . . . . . .  8
62	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 10
63	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 11
64	   8.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 11
65	   Appendix A.  History and Context . . . . . . . . . . . . . . . . . 11
66	   Appendix B.  The ASCII NVT Definition  . . . . . . . . . . . . . . 13
67	   Appendix C.  The Line-Ending Problem . . . . . . . . . . . . . . . 14
68	   Appendix D.  A Note About Related Future Work  . . . . . . . . . . 15
69	   Appendix E.  Change log  . . . . . . . . . . . . . . . . . . . . . 15
70	     E.1.  Changes from -00 to -01  . . . . . . . . . . . . . . . . . 15
71	     E.2.  Changes from -01 to -02  . . . . . . . . . . . . . . . . . 15
72	     E.3.  Changes from -02 to -03  . . . . . . . . . . . . . . . . . 16
73	     E.4.  Changes from -03 to -04  . . . . . . . . . . . . . . . . . 16
74	     E.5.  Changes from -04 to -05  . . . . . . . . . . . . . . . . . 16
75	     E.6.  Changes from -05 to -07  . . . . . . . . . . . . . . . . . 17
76	     E.7.  Changes in version -08 . . . . . . . . . . . . . . . . . . 17
77	     E.8.  Changes in version -09 . . . . . . . . . . . . . . . . . . 17
78	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 17
79	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 17
80	     9.2.  Informative References . . . . . . . . . . . . . . . . . . 18
81	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20
82	   Intellectual Property and Copyright Statements . . . . . . . . . . 22

84	1.  Introduction

86	1.1.  Requirement for a Standardized Text Stream Format

88	   Historically, Internet protocols have been largely ASCII-based and
89	   references to "text" in protocols have assumed ASCII text and
90	   specifically text in Network Virtual Terminal ("NVT") or "Network
91	   ASCII" form (see Appendix A and Appendix B).  Protocols and formats
92	   that have moved beyond ASCII have included arrangements to
93	   specifically identify the character set and often the language being
94	   used.

96	   In our more internationalized world, "text" clearly no longer equates
97	   unambiguously to "network ASCII".  Fortunately, however, we are
98	   converging on Unicode [Unicode] [ISO10646] as a single international
99	   interchange character coding and no longer need to deal with per-
100	   script standards for character sets (e.g., one standard for each of
101	   Arabic, Cyrillic, Devanagari, etc., or even standards keyed to
102	   languages that are usually considered to share a script, such as
103	   French, German, or Swedish).  Unfortunately, though, while it is
104	   certainly time to define a Unicode-based text type for use as a
105	   common text interchange format, "use Unicode" involves even more
106	   ambiguity than "use ASCII" did decades ago.

108	   Unicode identifies each character by an integer, called its "code
109	   point", in the range 0-0x10ffff.  These integers can be encoded into
110	   byte sequences for transmission in at least three standard and
111	   generally-recognized encoding forms, all of which are completely
112	   defined in The Unicode Standard and the documents cited below:

114	   o  UTF-8 [RFC3629] defines a variable-length encoding that may be
115	      applied uniformly to all code points.

117	   o  UTF-16 [RFC2781] encodes the range of Unicode characters whose
118	      code points are less than 65536 straightforwardly as 16-bit
119	      integers, and provides a "surrogate" mechanism for encoding larger
120	      code points in 32 bits.

122	   o  UTF-32 (also known as UCS-4) simply encodes each code point as a
123	      32-bit integer.

125	   Older forms and nomenclature, such as the 16 bit UCS-2, are now
126	   strongly discouraged.

128	   As with ASCII, any of these forms may be used with different line-
129	   ending conventions.  That flexibility can be an additional source of
130	   confusion with, e.g., index (offset) references into documents based
131	   on character counts.

133	   This document proposes to establish "Net-Unicode" as a new
134	   standardized text transmission form for the Internet, to serve as an
135	   internationalized alternative for NVT ASCII when specified in new --
136	   and, where appropriate, updated -- protocols.  UTF-8 [RFC3629] is
137	   chosen for the coding because it has good compatibility properties
138	   with ASCII and for other reasons discussed in the existing IETF
139	   character set policy [RFC2277].  "Net-Unicode" is specified in
140	   Section 2; the subsequent sections of the document provide background
141	   and explanation.

143	   In circumstances in which there is a choice, use of Unicode and the
144	   text encoding specified here is preferred to the double-byte encoding
145	   of "extended ASCII" [RFC0698] or the assorted per-language or per-
146	   country character coding systems and SHOULD be used.

148	1.2.  Terminology

150	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
151	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
152	   document are to be interpreted as described in [RFC2119].

154	1.3.  Mailing List

156	   [[RFC Editor: Please remove this subsection prior to publication.]]

158	   Along with related work on general internationalization issues, this
159	   document is being discussed on the discuss@apps.ietf.org mailing
160	   list.

162	2.  Net-Unicode Definition

164	   The Network Unicode format (Net-Unicode) is defined as follows:

166	   1.  Characters MUST be encoded in UTF-8 as defined in [RFC3629].

168	   2.  If the protocol has the concept of "lines", line-endings MUST be
169	       indicated by the sequence Carriage-Return (CR, U+000D) followed
170	       by Line-Feed (LF, U+000A), often known just as CRLF.  CR SHOULD
171	       NOT appear except when followed by LF.  The only other allowed
172	       context in which CR is permitted is in the combination CR NUL,
173	       which is not recommended (see the note at the end of this
174	       section).

176	   3.  The control characters in the ASCII range (U+0000 to U+001F and
177	       U+007F to U+009F) SHOULD generally be avoided.  CR, LF, and Form
178	       Feed (FF, U+000C) are exceptions to this principle.  However, use
179	       of all but the first requires care as discussed elsewhere in this
180	       document.  The so-called "C1 Controls" (U+0080 through U+009F),
181	       which did not appear in ASCII, MUST NOT appear.

183	       FF should be used only with caution: it does not have a standard
184	       and universal interpretation and, in particular, if its use
185	       assumes a page length, such assumptions may not be appropriate in
186	       international contexts (e.g., considering 8.5x11 inch paper
187	       versus A4).  Other control characters are used to affect display
188	       format, control devices, or to structure files.  None of those
189	       uses is appropriate for streams of plain text.

191	   4.  Before transmission, all character sequences SHOULD be normalized
192	       according to Unicode normalization form "NFC" (see Section 3).

194	   5.  As suggested in Section 6 of RFC 3629, the Byte Order Mark
195	       ("BOM") signature MUST NOT appear at the beginning of these text
196	       strings.

198	   6.  Systems conforming to this specification MUST NOT transmit any
199	       string containing any code point that is unassigned in the
200	       version of Unicode on which they are dependent.  The version of
201	       NFC and the version of Unicode used by that system MUST be
202	       consistent.

204	   The use of LF without CR is questionable; see Appendix B for more
205	   discussion.  The newer control characters IND (U+0084) and NEL ("Next
206	   Line", U+0085) might have been used to disambiguate the various line-
207	   ending situations, but, because their use has not been established on
208	   the Internet, because many protocols require CRLF, and because IND
209	   and NEL fall within the "C1 Controls" group (see above), they MUST
210	   NOT be used.  Similar observations apply to the yet newer line and
211	   paragraph separators at U+2028 and U+2029 and any future characters
212	   that might be defined to serve these functions.  For this
213	   specification and protocols that depend on it, lines end in CRLF and
214	   only in CRLF.  Strings that do not end in CRLF are either not lines
215	   or are not in conformance with this specification.

217	   The NVT specification contained a number of additional provisions,
218	   e.g., for the optional use of backspacing and "bare CR" (sent as CR
219	   NUL) to generate overstruck character sequences.  The much greater
220	   number of precomposed characters in Unicode, the availability of
221	   combining characters, and the growing use of markup conventions of
222	   various types to show, e.g., emphasis (rather than attempting to do
223	   that via the use of special characters), should make such sequences
224	   largely unnecessary.  These sequences SHOULD be avoided if at all
225	   possible.  However, because they were optional in NVT applications
226	   and this specification is an NVT superset, they cannot be prohibited
227	   entirely.  The most important of these rules is that CR MUST NOT
228	   appear unless it is immediately followed by LF (indicating end of
229	   line) or NUL.  Because NUL (an octet whose value is all zeros, i.e.,
230	   %x00 in the notation of [RFC5234]) is hostile to programming
231	   languages that use that character as a string delimiter, the CR NUL
232	   sequence SHOULD be avoided for that reason as well.

234	3.  Normalization

236	   There are cases where strings of Unicode are fundamentally
237	   equivalent, essentially representing the same text.  These are called
238	   "canonical equivalents" in the Unicode Standard.  For example, the
239	   following pairs of strings are canonically equivalent:

241	   U+2126 OHM SIGN
242	   U+03A9 GREEK CAPITAL LETTER OMEGA

244	   U+0061 LATIN SMALL LETTER A, U+0300 COMBINING GRAVE ACCENT
245	   U+00E0 LATIN SMALL LETTER A WITH GRAVE

247	   Comparison of strings becomes much easier if any such cases are
248	   always represented by a single unique form.  The Unicode Consortium
249	   specifies a normalization form, known as NFC [NFC], which provides
250	   the necessary mappings and mechanisms to convert all canonically
251	   equivalent sequences to a single unique form.  Typically, this form
252	   produces precomposed characters for any sequences that can be
253	   represented in that fashion.  It also reorders other combining marks
254	   so that they have a unique and unambiguous order.

256	   Of the various normalization forms defined as part of Unicode, NFC is
257	   closest to actual use in practice, minimizes side-effects due to
258	   considering characters equivalent that may not be equivalent in all
259	   situations, and typically requires the least work when converting
260	   from non-Unicode encodings.

262	   The section above requires that, except in very unusual
263	   circumstances, all Net-Unicode strings be transmitted in normalized
264	   form.  Recognition of the fact that some applications implementations
265	   may rely on operating system libraries over which they have little
266	   control and adherence to the robustness principle suggests that
267	   receivers of such strings should be prepared to receive unnormalized
268	   ones and to not react to that in excessive ways.

270	4.  Versions of Unicode

272	   Unicode changes and expands over time.  Large blocks of space are
273	   reserved for future expansion.  New versions, which appear at regular
274	   intervals, add new scripts and characters.  Occasionally they also
275	   change some property definitions.  In retrospect, one of the
276	   advantages of ASCII [X3.4-1968] when it was chosen was that the code
277	   space was full when the Standard was first published.  There was no
278	   practical way to add characters or change code point assignments
279	   without being obviously incompatible.

281	   While there are some security issues if people deliberately try to
282	   trick the system (see Section 6), Unicode version changes should not
283	   have a significant impact on the text stream specification of this
284	   document for the following reasons:

286	   o  The transformation between Unicode code table positions and the
287	      corresponding UTF-8 code is algorithmic; it does not depend on
288	      whether a code point has been assigned or not.

290	   o  The normalization recommended here, NFC (see Section 3), performs
291	      a very limited set of mappings, much more limited than those of
292	      the more extensive NFKC used in, e.g., Nameprep [RFC3491].

294	   The NFC tables may be updated over time as new characters are added,
295	   but the Unicode Consortium has guaranteed the stability of all NFC
296	   strings.  That is, if a string does not contain any unassigned
297	   characters, and it is normalized according to NFC, it will always be
298	   normalized according to all future versions of the Unicode Standard.
299	   The stability of the Net-Unicode format is thus guaranteed when any
300	   implementation that converts text into Net-Unicode format does not
301	   permit unassigned characters.

303	   Because Unicode code points that are reserved for private use do not
304	   have standard definitions or normalization interpretations, they
305	   SHOULD be avoided in strings intended for Internet interchange.

307	   Were Unicode to be changed in a way that violated these assumptions,
308	   i.e., that either invalidated the byte string order specified in RFC
309	   3629 or that changed the stability of NFC as stated above, this
310	   specification would not apply.  Put differently, this specification
311	   applies only to versions of Unicode starting with version 5.0 and
312	   extending to, but not including, any version for which changes are
313	   made in either the UTF-8 definition or to NFC stability.  Such
314	   changes would violate established Unicode policies and are hence
315	   unlikely, but, should they occur, it would be necessary to evaluate
316	   them for compatibility with this specification and other Internet
317	   uses of NFC.

319	   If the specification of a protocol references this one, strings that
320	   are received by that protocol and that appear to be UTF-8 and are not
321	   otherwise identified (e.g., by charset labeling) SHOULD be treated as
322	   using UTF-8 in conformance with this specification.

324	5.  Applicability and Stability of this Specification

326	5.1.  Use in IETF Applications Specifications

328	   During the development of this specification, there was some
329	   confusion about where it would be useful given that, e.g., the
330	   individual MIME media types used in email and with HTTP have their
331	   own rules about UTF-8 character types and normalization and the
332	   application transport protocols impose their own conventions about
333	   line endings.  There are three answers.  The first is that, in
334	   retrospect, it would have been better to have those protocols and
335	   content types standardized in the way specified here, even though it
336	   is certainly too late to change them at this time.  The second is
337	   that we have several protocols that are dependent on either the
338	   original Telnet design or other arrangements requiring a standard,
339	   interoperable, string definition without specific content-labels of
340	   one sort or another.  Whois [RFC3912] is an example member of this
341	   group.  As consideration is given to upgrading them for non-ASCII
342	   use, this specification provides a normative reference that provides
343	   the same stability that NVT has provided the ASCII forms.  This
344	   specification is intended for use by other specifications that have
345	   not yet defined how to use Unicode.  Having a preferred standard
346	   Internet definition for Unicode text streams -- rather than just one
347	   for transmission codings -- may help improve the specification and
348	   interoperability of protocols to be developed in the future.  This
349	   specification is not intended for use with specifications that
350	   already allow the use of UTF-8 and precisely define that use.

352	5.2.  Unicode Versions and Applicability

354	   The IETF faces a practical dilemma with regard to versions of
355	   Unicode.  Each new version brings with it new characters and
356	   sometimes new combining characters.  Version 5.0 introduces the new
357	   concept of sequences of characters named as if they were individual
358	   characters (see [NamedSequences]).  The normalization represented by
359	   NFC is stable if all strings are transmitted and stored in normalized
360	   form if corrections are never made to character definitions or
361	   normalization tables and if unassigned code points are never used.
362	   The latter is important because an unassigned code point always
363	   normalizes to itself.  However, if the same code point is assigned to
364	   a character in a future version, it may participate in some other
365	   normalization mapping (some specific difficulties in this regard are
366	   discussed in [RFC4690]).  It is worth noting that transmission in
367	   normalized form is not required by either the IETF's UTF-8 Standard
368	   [RFC3629] or by standards dependent on the current version of
369	   Stringprep [RFC3454].

371	   All would be well with this as described in Section 4 except for one
372	   problem: Applications typically do not perform their own conversions
373	   to Unicode and may not perform their own normalizations but instead
374	   rely on operating system or language library functions -- functions
375	   that may be upgraded or otherwise changed without changes to the
376	   application code itself.  Consequently, there may be no plausible way
377	   for an application to know which version of Unicode, or which version
378	   of the normalization procedures, it is utilizing, nor is there any
379	   way by which it can guarantee that the two will be consistent.

381	   Because of per-version changes in definitions and tables, Stringprep
382	   and documents depending on it are now tied to Unicode Version 3.2
383	   [Unicode32] and full interoperability of Internet Standard UTF-8
384	   [RFC3629], when used with normalization as specified here, is
385	   dependent on normalization definitions and the definition of UTF-8
386	   itself not changing after Unicode Version 5.0.  These assumptions
387	   seem fairly safe, but they are still assumptions.  Rather than being
388	   linked to the latest available version of Unicode, version 5.0
389	   [Unicode] or broader concepts of version independence based on
390	   specific assumptions and conditions, this specification could
391	   reasonably have been tied, like Stringprep and Nameprep to Unicode
392	   3.2 [Unicode32] or some more recent intermediate version, but, in
393	   addition to the obvious disadvantages of having different IETF
394	   standards tied to different versions of Unicode, the library-based
395	   application implementation behavior described above makes these
396	   version linkages nearly meaningless in practice.

398	   In theory, one can get around this problem in four ways:

400	   1.  Freeze on a particular version of Unicode and try to insist that
401	       applications enforce that version by, e.g., containing lists of
402	       unassigned characters and prohibiting their use.  Of course, this
403	       would prohibit evolution to include newly-added scripts and the
404	       tables of unassigned code points would be cumbersome.

406	   2.  Require that every Unicode "text" string or file start with a
407	       version indication, somewhat akin to the "byte order mark"
408	       indicator.  It is unlikely that this provision would be
409	       practical.  More important, it would require that each
410	       application implementation be prepared to either support multiple
411	       normalization tables and versions or that it reject text from
412	       Unicode Versions with which it was not prepared to deal.

414	   3.  Devise a different set of normalization rules that would, e.g.,
415	       guarantee that no character assigned to a previously-unassigned
416	       code point in Unicode was ever normalized to anything but itself
417	       and use those rules instead of NFC.  It is not clear whether or
418	       not such a set of rules is possible or whether some other
419	       completely stable set of rules could be devised, perhaps in
420	       combination with restrictions on the ways in which characters
421	       were added in future versions of Unicode.

423	   4.  Devise a normalization process that is otherwise equivalent to
424	       NFC but that rejects code points that are unassigned in the
425	       current version of Unicode, rather than mapping those code points
426	       to themselves.  This would still leave some risk of incompatible
427	       corrections in Unicode and possibly a few edge cases, but it is
428	       probably stable enough for Internet use in the overwhelming
429	       number of cases.  This process has been discussed in the Unicode
430	       Consortium under the name "Stable NFC".

432	   None of these approaches seems ideal: the ideal procedure would be as
433	   stable and predictable as ASCII has been.  But that level is simply
434	   not feasible as long as Unicode continues to evolve by the addition
435	   of new code points and scripts.  The fourth option listed above
436	   appears to be a reasonable compromise.

438	6.  Security Considerations

440	   This specification provides a standard form for the use of Unicode as
441	   "network text".  Most of the same security issues that apply to
442	   UTF-8, as discussed in [RFC3629], apply to it, although it should be
443	   slightly less subject to some risks by virtue of requiring NFC
444	   normalization and generally being somewhat more restrictive.
445	   However, shifts in Unicode versions, as discussed in Section 5.2, may
446	   introduce other security issues.

448	   Programs that receive these streams should use extreme caution about
449	   assuming that incoming data are normalized, since it might be
450	   possible to use unnormalized forms, as well as invalid UTF-8, as part
451	   of an attack.  In particular, firewalls and other systems that
452	   interpret UTF-8 streams should be developed with the clear knowledge
453	   that an attacker may deliberately send unnormalized text, for
454	   instance to avoid detection by naive text-matching systems.

456	   NVT contains a requirement, of necessity repeated here (see
457	   Section 2), that the CR character be immediately followed by either
458	   LF or ASCII NUL (an octet with all bits zero).  NUL may be
459	   problematic for some programming languages that use it as a string
460	   terminator, and hence a trap for the unwary, unless caution is used.
461	   This may be an additional reason to avoid the use of CR entirely,
462	   except in sequence with LF, as suggested above.

464	   The discussion about Unicode versions above (see Section 4 and
465	   Section 5.2) makes several assumptions about future versions of
466	   Unicode, about NFC normalization being applied properly, and about
467	   UTF-8 being processed and transmitted exactly as specified in RFC
468	   3629.  If any of those assumptions are not correct, then there are
469	   cases in which strings that would be considered equivalent do not
470	   compare equal.  Robust code should be prepared for those
471	   possibilities.

473	7.  IANA Considerations

475	   [[RFC Editor: Please remove this useless subsection prior to
476	   publication.]]

478	   This specification requires no actions of any type from the IANA.

480	8.  Acknowledgments

482	   Many thanks to Mark Davis, Martin Duerst, and Michel Suignard for
483	   suggestions about Unicode normalization that led to the format
484	   described here and especially to Mark for providing the paragraphs
485	   that describe the role of NFC.  Thanks also to Mark, Doug Ewell,
486	   Asmus Freytag for corrected text describing Unicode transmission
487	   forms and to Tim Bray, Carsten Bormann, Stephane Bortzmeyer, Martin
488	   Duerst, Frank Ellermann, Clive D.W. Feather, Ted Hardie, Bjoern
489	   Hoehrmann, Alfred Hoenes, Kent Karlsson, Bill McQuillan, George
490	   Michaelson, Chris Newman, and Marcos Sanz for a number of helpful
491	   comments and clarification requests.

493	Appendix A.  History and Context

495	   This subsection contains a review of prior work in the ARPANET and
496	   Internet to establish a standard text type, work that establishes the
497	   context and motivation for the approach taken in this document.  The
498	   text is explanatory rather than normative: nothing in this section is
499	   intended to change or update any current specification.  Those who
500	   are uninterested in this review and analysis can safely skip this
501	   section.

503	   One of the earlier application design decisions made in the
504	   development of ARPANET, a decision that was carried forward into the
505	   Internet, was the decision to standardize on a single and very
506	   specific coding for "text" to be passed across the network [RFC0020].
507	   Hosts on the network were then responsible for translating or mapping
508	   from whatever character coding conventions were used locally to that
509	   common intermediate representation, with sending hosts mapping to it
510	   and receiving ones mapping from it to their local forms as needed.
511	   It is interesting to note that at the time the ARPANET was being
512	   developed, participating host operating systems used at least three
513	   different character coding standards: the antiquated BCD (Binary
514	   Coded Decimal), the then-dominant major manufacturer-backed EBCDIC
515	   (Extended BCD Interchange Code), and the then-still emerging ASCII
516	   (American Standard Code for Information Interchange).  Since the
517	   ARPANET was an "open" project and EBCDIC was intimately linked to a
518	   particular hardware vendor, the original Network Working Group agreed
519	   that its standard should be ASCII.  That ASCII form was precisely
520	   "7-bit ASCII in an 8-bit field", which was in effect a compromise
521	   between hosts that were natively 7-bit oriented (e.g., with five
522	   seven-bit characters in a 36 bit word), those that were 8-bit
523	   oriented (using eight-bit characters) and those that placed the
524	   seven-bit ASCII characters in 9-bit fields with two leading zero bits
525	   (four characters in a 36 bit word).

527	   More standardization was suggested in the first preliminary
528	   description of the Telnet protocol [RFC0097].  With the iterations of
529	   that protocol [RFC0137] [RFC0139] and the drawing together of an
530	   essentially formal definition somewhat later [RFC0318], a standard
531	   abstraction, the Network Virtual Terminal (NVT) was established.  NVT
532	   character-coding conventions (initially called "Telnet ASCII" and
533	   later called "NVT ASCII", or, more casually, "network ASCII")
534	   included the requirement that Carriage Return followed by Line Feed
535	   (CRLF) be the common representation for ending lines of text (given
536	   that some participating "Host" operating systems used the one
537	   natively, some the other, at least one used both, and a few used
538	   neither (preferring variable-length lines with counts or special
539	   delimiters or markers instead) and specified conventions for some
540	   other characters.  Also, since NVT ASCII was restricted to seven-bit
541	   characters, use of the high-order bit in octets was reserved for the
542	   transmission of control signaling information.

544	   At a very high level, the concept was that a system could use
545	   whatever character coding and line representations were appropriate
546	   locally, but text transmitted over the network as text must conform
547	   to the single "network virtual terminal" convention.  Virtually all
548	   early Internet protocols that presume transfer of "text" assume this
549	   virtual terminal model, although different ones assume or limit it in
550	   different ways.  Telnet, the command stream and ASCII Type in FTP
551	   [RFC0542], the message stream in SMTP transfer [RFC2821], and the
552	   strings passed to finger [RFC0742] and whois [RFC0954] are the
553	   classic examples.  More recently, HTTP [RFC1945] [RFC2616] follows
554	   the same general model but permits 8 bit data and leaves the line end
555	   sequence unspecified (the latter has been the source of a significant
556	   number of problems).

558	Appendix B.  The ASCII NVT Definition

560	   The main body of this specification is intended as an update to, and
561	   internationalized version of, the Net-ASCII definition.  The
562	   specification is self-contained in that parts of the Net-ASCII
563	   definition that are no longer recommended are not included above.
564	   Because Net-ASCII evolved somewhat over time and there has been
565	   debate about which specification is the "official" Net-ASCII, it is
566	   appropriate to review the key elements of that definition here.  This
567	   review is informal with regard to the contents of Net-ASCII and
568	   should not be considered as a normative update or summary of the
569	   earlier specifications (Section 2 does specify some normative updates
570	   to those specifications and some comments below are consistent with
571	   it).

573	   The first part of the section titled "THE NVT PRINTER AND KEYBOARD"
574	   in RFC 854 [RFC0854] is generally, although not universally,
575	   considered to be the normative definition of the (ASCII) Network
576	   Virtual Terminal and hence of Net-ASCII.  It includes not only the
577	   graphic ASCII characters but a number of control characters.  The
578	   latter are given Internet-specific meanings that are often more
579	   specific than the definitions in the ASCII specification.  In today's
580	   usage, and for the present specification, the following
581	   clarifications and updates to that list should be noted.  Each one is
582	   accompanied by a brief explanation of the reason why the original
583	   specification is no longer appropriate.

585	   1.  The "defined but not required" codes -- BEL (U+0007), BS
586	       (U+0008), HT (U+0009), VT (U+000B), and FF (U+000C) -- and the
587	       undefined control codes ("C0") SHOULD NOT be used unless required
588	       by exceptional circumstances.  Either their original "network
589	       printer" definitions are no longer in general use, common
590	       practice has evolved away from the formats specified there, or
591	       their use to simulate characters that are better handled by
592	       Unicode is no longer appropriate.  While the appearance of some
593	       of these characters on the list may seem surprising, BS now has
594	       an ambiguous interpretation in practice (erasing in some systems
595	       but not in others), the width associated with HT varies with the
596	       environment, and VT and FF do not have a uniform effect with
597	       regard to either vertical positioning or the associated
598	       horizontal position result.  Of course, telnet escapes are not
599	       considered part of the data stream and hence are unaffected by
600	       this provision.

602	   2.  In Net-ASCII, CR MUST NOT appear except when immediately followed
603	       by either NUL or LF, with the latter (CR LF) designating the "new
604	       line" function.  Today and as specified above, CR should
605	       generally appear only when followed by LF.  Because page layout
606	       is better done in other ways, because NUL has a special
607	       interpretation in some programming languages, and to avoid other
608	       types of confusion, CR NUL should preferably be avoided as
609	       specified above.

611	   3.  LF CR SHOULD NOT appear except as a side-effect of multiple CR LF
612	       sequences (e.g., CR LF CR LF).

614	   4.  The historical NVT documents do not call out either "bare LF" (LF
615	       without CR) or HT for special treatment.  Both have generally
616	       been understood to be problematic.  In the case of LF, there is a
617	       difference in interpretation as to whether its semantics imply
618	       "go to same position on the next line" or "go to the first
619	       position on the next line" and interoperability considerations
620	       suggest not depending on which interpretation the receiver
621	       applies.  At the same time, misinterpretation of LF is less
622	       harmful than misinterpretation of "bare" CR: in the CR case, text
623	       may be erased or made completely unreadable; in the LF one, the
624	       worst consequence is a very funny-looking display.  Obviously, HT
625	       is problematic because there is no standard way to transmit
626	       intended tab position or width information in running text.
627	       Again, the harm is unlkely to be great if HT is simply
628	       interpreted as one or more spaces, but, in general, it cannot be
629	       relied upon to format information.

631	   It is worth noting that the telnet IAC character (an octet consisting
632	   of all ones, i.e., %xFF) itself is not a problem for UTF-8 since that
633	   particular octet cannot appear in a valid UTF-8 string.  However,
634	   while few of them have been used, telnet permits other command-
635	   introducer characters whose bit sequences in an octet may be part of
636	   valid UTF-8 characters.  While it causes no ambiguity in UTF-8,
637	   Unicode assigns a graphic character ("Latin Small Letter Y with
638	   Diaeresis") to U+00FF (octets C3 B0 in UTF-8).  Some caution is
639	   clearly in order in this area.

641	Appendix C.  The Line-Ending Problem

643	   The definition of how a line ending should be denoted in plain text
644	   strings on the wire for the Internet has been controversial from even
645	   before the introduction of NVT.  Some have argued that recipients
646	   should be required to interpret almost anything that a sender might
647	   intend as a line ending as actually a line ending.  Others have
648	   pointed out that this would lead to some ambiguities of
649	   interpretation and presentation and would violate the principle that
650	   we should minimize the number of forms that are permitted on the wire
651	   in order to promote interoperability and eliminate the "every
652	   recipient needs to understand every sender format" problem.  The
653	   design of this specification, like that of NVT, takes the latter
654	   approach.  Its designers believe that there is little point in a
655	   standard if it is to specify "anyone can do whatever they like and
656	   the receiver just needs to cope".

658	   A further discussion of the nature and evolution of the line-ending
659	   problem appears in Section 5.8 of the Unicode Standard [Unicode] and
660	   is suggested for additional reading.  If we were starting with the
661	   Internet today, it would probably be sensible to follow the
662	   recommendation there and use LS (U+2028) exclusively, in preference
663	   to CRLF.  However, the installed base of use of CRLF and the
664	   importance of forward compatibility with NVT and protocols that
665	   assume it makes that impossible, so it is necessary to continue using
666	   CRLF as the "New Line Function" ("NLF", see the terminology section
667	   in that reference) discussed there.

669	Appendix D.  A Note About Related Future Work

671	   Once this proposal is approved, consideration should be given to a
672	   Telnet (or SSH [RFC4251]) option to specify this type of stream and
673	   an FTP extension [RFC0959] to permit a new "Unicode text" data TYPE.

675	Appendix E.  Change log

677	   [[ RFC Editor: Please remove this section before publication. ]]

679	E.1.  Changes from -00 to -01

681	   o  Replaced the section on Normalization with text provided by Mark
682	      Davis

684	   o  Several small editorial changes and corrections.

686	E.2.  Changes from -01 to -02

688	   o  Added material explaining the relationship to Net-ASCII and the
689	      NVT.

691	   o  Brought the material on transmission forms into line with current
692	      practice and terminology.

694	   o  Made terminology more consistent.

696	   o  Inserted normalization text provided by Mark Davis.

698	   o  Rewrote and reorganized Unicode versioning material.

700	   o  Clarified relationships to existing protocols, stressing that this
701	      is not, in itself, a proposal to change any of them.

703	E.3.  Changes from -02 to -03

705	   o  Clarification of several relationships and updating to reflect
706	      mailing list comments and other work.

708	   o  Inserted a discussion and pair of placeholders about prohibited
709	      NVT characters.

711	   o  Several corrections of typographic and editorial errors and
712	      additions of relevant references.

714	E.4.  Changes from -03 to -04

716	   o  Reduced requirement for NFC on transmission to a SHOULD, per on-
717	      list discussion and the realization that receivers cannot safely
718	      assuming that normalization was applied.

720	   o  Rewrote the discussion of Net-ASCII to separate changes for Net-
721	      Unicode from the original model, rewrote the description of the
722	      latter, and moved most background/ historical material to
723	      appendices, as suggested by Chris Newman and others.

725	   o  Several small editorial improvements, including those suggested in
726	      a March note from Chris Newman.

728	   o  Removed remain editorial/ work in progress notes.

730	E.5.  Changes from -04 to -05

732	   o  Additions to Security Considerations and elsewhere for
733	      unnormalized text.

735	   o  Discussion of FormFeed (FF) rewritten and rationale provided

737	   o  Added preliminary "updates" and "obsoletes" indications.

739	   o  Significant rewrites, and some text moved to a new appendix,
740	      responding from comments from Martin Duerst.

742	   o  Several small editorial / typographical corrections.

744	E.6.  Changes from -05 to -07

746	   Version -06 and -07 included a number of editorial improvements, plus
747	   addition discussion of characters that were included or excluded,
748	   especially characters that end lines or set the position of the next
749	   character to be displayed.

751	   Version -07 became the version announced for IETF Last Call.

753	E.7.  Changes in version -08

755	   These changes were made subsequent to IETF Last Call and in response
756	   to Last Call comments.

758	   o  Added a useless IANA Considerations section so that the RFC Editor
759	      can remove it (at the same time this log is removed).

761	   o  Clarified that, if the relevant protocol doesn't have a concept of
762	      "line", the material in Section 2, bullet 2 is irrelevant.

764	   o  Rearranged some text, modified section titles, and elaborated on
765	      the comment at the end of Appendix C to improve clarity.

767	   o  Added an additional discussion of Line Ending issues as an
768	      appendix (Appendix C with renumbering).

770	   o  Several editorial corrections and clarifications, including
771	      reference updates.

773	E.8.  Changes in version -09

775	   Some additional editorial changes that weren't picked up or -08.

777	9.  References

779	9.1.  Normative References

781	   [ISO10646]
782	              International Organization for Standardization,
783	              "Information Technology - Universal Multiple- Octet Coded
784	              Character Set (UCS)"", ISO/IEC 10646:2003 (with
785	              amendments), 2003.

787	   [NFC]      Davis, M. and M. Duerst, "Unicode Standard Annex #15:
788	              Unicode Normalization Forms", October 2006,
789	              <http://www.unicode.org/reports/tr15/>.

791	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
792	              Requirement Levels", BCP 14, RFC 2119, March 1997.

794	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
795	              10646", STD 63, RFC 3629, November 2003.

797	   [RFC5234]  Crocker, D. and P. Overell, "Augmented BNF for Syntax
798	              Specifications: ABNF", STD 68, RFC 5234, January 2008.

800	   [Unicode]  The Unicode Consortium, "The Unicode Standard, Version
801	              5.0", 2007.

803	              Boston, MA, USA: Addison-Wesley.  ISBN 0-321-48091-0

805	   [Unicode32]
806	              The Unicode Consortium, "The Unicode Standard, Version
807	              3.0", 2000.

809	              (Reading, MA, Addison-Wesley, 2000.  ISBN 0-201-61633-5).
810	              Version 3.2 consists of the definition in that book as
811	              amended by the Unicode Standard Annex #27: Unicode 3.1
812	              (http://www.unicode.org/reports/tr27/) and by the Unicode
813	              Standard Annex #28: Unicode 3.2
814	              (http://www.unicode.org/reports/tr28/).

816	9.2.  Informative References

818	   [ISO.646.1991]
819	              International Organization for Standardization,
820	              "Information technology - ISO 7-bit coded character set
821	              for information interchange", ISO Standard 646, 1991.

823	   [ISO.8859.2003]
824	              International Organization for Standardization,
825	              "Information processing - 8-bit single-byte coded graphic
826	              character sets - Part 1: Latin alphabet No. 1 (1998) -
827	              Part 2: Latin alphabet No. 2 (1999) - Part 3: Latin
828	              alphabet No. 3 (1999) - Part 4: Latin alphabet No. 4
829	              (1998) - Part 5: Latin/Cyrillic alphabet (1999) - Part 6:
830	              Latin/Arabic alphabet (1999) - Part 7: Latin/Greek
831	              alphabet (2003) - Part 8: Latin/Hebrew alphabet (1999) -
832	              Part 9: Latin alphabet No. 5 (1999) - Part 10: Latin
833	              alphabet No. 6 (1998) - Part 11: Latin/Thai alphabet
834	              (2001) - Part 13: Latin alphabet No. 7 (1998) - Part 14:
835	              Latin alphabet No. 8 (Celtic) (1998) - Part 15: Latin
836	              alphabet No. 9 (1999) - Part 16: Part 16: Latin alphabet
837	              No. 10 (2001)", ISO Standard 8859, 2003.

839	   [NamedSequences]
840	              The Unicode Consortium, "NamedSequences-4.1.0.txt", 2005,
841	              <http://www.unicode.org/Public/UNIDATA/
842	              NamedSequences.txt>.

844	   [RFC0020]  Cerf, V., "ASCII format for network interchange", RFC 20,
845	              October 1969.

847	   [RFC0097]  Melvin, J. and R. Watson, "First Cut at a Proposed Telnet
848	              Protocol", RFC 97, February 1971.

850	   [RFC0137]  O'Sullivan, T., "Telnet Protocol - a proposed document",
851	              RFC 137, April 1971.

853	   [RFC0139]  O'Sullivan, T., "Discussion of Telnet Protocol", RFC 139,
854	              May 1971.

856	   [RFC0318]  Postel, J., "Telnet Protocols", RFC 318, April 1972.

858	   [RFC0542]  Neigus, N., "File Transfer Protocol", RFC 542,
859	              August 1973.

861	   [RFC0698]  Mock, T., "Telnet extended ASCII option", RFC 698,
862	              July 1975.

864	   [RFC0742]  Harrenstien, K., "NAME/FINGER Protocol", RFC 742,
865	              December 1977.

867	   [RFC0854]  Postel, J. and J. Reynolds, "Telnet Protocol
868	              Specification", STD 8, RFC 854, May 1983.

870	   [RFC0954]  Harrenstien, K., Stahl, M., and E. Feinler, "NICNAME/
871	              WHOIS", RFC 954, October 1985.

873	   [RFC0959]  Postel, J. and J. Reynolds, "File Transfer Protocol",
874	              STD 9, RFC 959, October 1985.

876	   [RFC1945]  Berners-Lee, T., Fielding, R., and H. Nielsen, "Hypertext
877	              Transfer Protocol -- HTTP/1.0", RFC 1945, May 1996.

879	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
880	              Languages", BCP 18, RFC 2277, January 1998.

882	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
883	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
884	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

886	   [RFC2781]  Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO
887	              10646", RFC 2781, February 2000.

889	   [RFC2821]  Klensin, J., "Simple Mail Transfer Protocol", RFC 2821,
890	              April 2001.

892	   [RFC3454]  Hoffman, P. and M. Blanchet, "Preparation of
893	              Internationalized Strings ("stringprep")", RFC 3454,
894	              December 2002.

896	   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
897	              Profile for Internationalized Domain Names (IDN)",
898	              RFC 3491, March 2003.

900	   [RFC3912]  Daigle, L., "WHOIS Protocol Specification", RFC 3912,
901	              September 2004.

903	   [RFC4251]  Ylonen, T. and C. Lonvick, "The Secure Shell (SSH)
904	              Protocol Architecture", RFC 4251, January 2006.

906	   [RFC4690]  Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and
907	              Recommendations for Internationalized Domain Names
908	              (IDNs)", RFC 4690, September 2006.

910	   [X3.4-1968]
911	              American National Standards Institute (formerly United
912	              States of America Standards Institute), "USA Code for
913	              Information Interchange", ANSI X3.4-1968, 1968.

915	              ANSI X3.4-1968 has been replaced by newer versions with
916	              slight modifications, but the 1968 version remains
917	              definitive for the Internet.

919	Authors' Addresses

921	   John C Klensin
922	   1770 Massachusetts Ave, #322
923	   Cambridge, MA  02140
924	   USA

926	   Phone: +1 617 491 5735
927	   Email: john-ietf@jck.com
928	   Michael A. Padlipsky
929	   8011 Stewart Ave.
930	   Los Angeles, CA  90045
931	   USA

933	   Phone: +1 310-670-4288
934	   Email: the.map@alum.mit.edu

936	Full Copyright Statement

938	   Copyright (C) The IETF Trust (2008).

940	   This document is subject to the rights, licenses and restrictions
941	   contained in BCP 78, and except as set forth therein, the authors
942	   retain all their rights.

944	   This document and the information contained herein are provided on an
945	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
946	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
947	   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
948	   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
949	   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
950	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

952	Intellectual Property

954	   The IETF takes no position regarding the validity or scope of any
955	   Intellectual Property Rights or other rights that might be claimed to
956	   pertain to the implementation or use of the technology described in
957	   this document or the extent to which any license under such rights
958	   might or might not be available; nor does it represent that it has
959	   made any independent effort to identify any such rights.  Information
960	   on the procedures with respect to rights in RFC documents can be
961	   found in BCP 78 and BCP 79.

963	   Copies of IPR disclosures made to the IETF Secretariat and any
964	   assurances of licenses to be made available, or the result of an
965	   attempt made to obtain a general license or permission for the use of
966	   such proprietary rights by implementers or users of this
967	   specification can be obtained from the IETF on-line IPR repository at
968	   http://www.ietf.org/ipr.

970	   The IETF invites any interested party to bring to its attention any
971	   copyrights, patents or patent applications, or other proprietary
972	   rights that may cover technology that may be required to implement
973	   this standard.  Please address the information to the IETF at
974	   ietf-ipr@ietf.org.

976	Acknowledgment

978	   Funding for the RFC Editor function is provided by the IETF
979	   Administrative Support Activity (IASA).