idnits 2.17.1 

draft-shafranovich-rfc4180-bis-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 582 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (19 March 2022) is 768 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  ** Obsolete normative reference: RFC 7231 (Obsoleted by RFC 9110)

  -- Obsolete informational reference (is this intentional?): RFC  793
     (Obsoleted by RFC 9293)


     Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                    Y. Shafranovich
3	Internet-Draft                                  Nightwatch Cybersecurity
4	Intended status: Informational                             19 March 2022
5	Expires: 20 September 2022

7	   Common Format and MIME Type for Comma-Separated Values (CSV) Files
8	                   draft-shafranovich-rfc4180-bis-02

10	Abstract

12	   This RFC documents the common format used for Comma-Separated Values
13	   (CSV) files and updates the associated MIME type "text/csv".

15	Status of This Memo

17	   This Internet-Draft is submitted in full conformance with the
18	   provisions of BCP 78 and BCP 79.

20	   Internet-Drafts are working documents of the Internet Engineering
21	   Task Force (IETF).  Note that other groups may also distribute
22	   working documents as Internet-Drafts.  The list of current Internet-
23	   Drafts is at https://datatracker.ietf.org/drafts/current/.

25	   Internet-Drafts are draft documents valid for a maximum of six months
26	   and may be updated, replaced, or obsoleted by other documents at any
27	   time.  It is inappropriate to use Internet-Drafts as reference
28	   material or to cite them other than as "work in progress."

30	   This Internet-Draft will expire on 20 September 2022.

32	Copyright Notice

34	   Copyright (c) 2022 IETF Trust and the persons identified as the
35	   document authors.  All rights reserved.

37	   This document is subject to BCP 78 and the IETF Trust's Legal
38	   Provisions Relating to IETF Documents (https://trustee.ietf.org/
39	   license-info) in effect on the date of publication of this document.
40	   Please review these documents carefully, as they describe your rights
41	   and restrictions with respect to this document.

43	Table of Contents

45	   1.  Introduction
46	     1.1.  Terminology
47	     1.2.  Motivation For and Status of This Document
48	   2.  Definition of the CSV Format
49	     2.1.  High level description
50	     2.2.  Default charset and line break values
51	     2.3.  ABNF Grammar
52	   3.  Common implementation concerns
53	     3.1.  Null values
54	     3.2.  Empty files
55	     3.3.  Empty lines
56	     3.4.  Fields spanning multiple lines
57	     3.5.  Unique header names
58	     3.6.  Whitespace outside of quoted fields
59	     3.7.  Other field separators
60	     3.8.  Escaping double quotes
61	     3.9.  BOM header
62	   4.  Update to MIME Type Registration of text/csv
63	     4.1.  IANA Considerations
64	   5.  Security Considerations
65	   6.  Acknowledgments
66	   7.  References
67	     7.1.  Normative References
68	     7.2.  Informative References
69	   Appendix A.  Major changes since RFC4180
70	   Appendix B.  Changes since the -00 draft
71	   Appendix C.  Changes since the -01 draft
72	   Appendix D.  Note to Readers
73	   Author's Address

75	1.  Introduction

77	   The comma separated values format (CSV) has been used as a common way
78	   to exchange data between disparate systems and applications for many
79	   years.  Surprisingly, while this format is very popular, it has never
80	   been formally documented and didn't have a media type registered.
81	   This was addressed in 2005 via publication of [RFC4180] and the
82	   concurrent registration of the "text/csv" media type.

84	   Since the publication of [RFC4180], the CSV format has evolved and
85	   this specification seeks to reflect these changes as well as update
86	   the "text/csv" media type registration.

88	1.1.  Terminology

90	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
91	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
92	   "OPTIONAL" in this document are to be interpreted as described in BCP
93	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
94	   capitals, as shown here.

96	1.2.  Motivation For and Status of This Document

98	   The original motivation of [RFC4180] was to provide a reference in
99	   order to register the media type "text/csv".  It tried to document
100	   existing practices at the time based on the approaches used by most
101	   implementations.  This document continues to do the same, and updates
102	   the original document to reflect current practices for generating and
103	   consuming of CSV files.

105	   Both [RFC4180] and this document are published as informational RFC
106	   for the benefit of the Internet community and and not intended to be
107	   used as formal standards.  Implementers should consult [RFC1796] and
108	   [RFC2026] for crucial differences between IETF standards and
109	   informational RFCs.

111	2.  Definition of the CSV Format

113	   While there had been various specifications and implementations for
114	   the CSV format (for ex.  [CREATIVYST], [EDOCEO], [CSVW] and [ART])),
115	   prior to publication of [RFC4180] there is no attempt to provide a
116	   common specification.  This section documents the format that seems
117	   to be followed by most implementations (incorporating changes since
118	   the publication of [RFC4180]).

120	2.1.  High level description

122	   1.  Each record is located on a separate line, ended by a line break
123	       (CR, LF or CRLF).  For example:

125	       aaa,bbb,cccCRLF
126	       zzz,yyy,xxxCRLF

128	   2.  The last record in the file MUST have an ending line break.  For
129	       example:

131	       aaa,bbb,cccCRLF
132	       zzz,yyy,xxxCRLF

134	   3.  The first record in the file MAY be an optional header with the
135	       same format as normal records.  This header will contain names
136	       corresponding to the fields in the file and SHOULD contain the
137	       same number of fields as the records in the rest of the file.
138	       For example:

140	       field_name_1,field_name_2,field_name_3CRLF
141	       aaa,bbb,cccCRLF
142	       zzz,yyy,xxxCRLF

144	   4.  Within each record, there MAY be one or more fields, separated by
145	       commas.  Each record SHOULD contain the same number of fields
146	       throughout the file.  Spaces are considered part of a field and
147	       SHOULD NOT be ignored.  The last field in the record MUST NOT be
148	       followed by a comma.  For example:

150	       aaa,bbb,cccCRLF

152	   5.  Each field MAY be enclosed in double quotes (however some
153	       programs, do not use double quotes at all).  If fields are not
154	       enclosed with double quotes, then double quotes MUST NOT appear
155	       inside the fields.  For example:

157	       "aaa","bbb","ccc"CRLF
158	       zzz,yyy,xxxCRLF

160	   6.  Fields containing line breaks (CR, LF or CRLF), double quotes, or
161	       commas MUST be enclosed in double-quotes.  The same applies for
162	       the first field of a record that starts with a hash to avoid the
163	       field from being parsed as a comment.  For example:

165	       "aaa","b CRLF
166	       bb","ccc"CRLF
167	       zzz,yyy,xxxCRLF
168	       "#aaa",#bbb,cccCRLF

170	   7.  A double-quote appearing inside a field MUST be escaped by
171	       preceding it with another double quote.  For example:

173	       "aaa","b""bb","ccc"CRLF

175	   8.  A hash sign MAY be used to mark lines that are meant to be
176	       commented lines.  A commented line can contain any whitespace or
177	       visible character until it is terminated by a line break (CR, LF
178	       or CRLF).  A comment line MAY appear in any line of the file
179	       (before or after an OPTIONAL header) but MUST NOT be mistaken
180	       with a subsequent line of a multi-line field.  Subsequent lines
181	       of multi-line fields can start with a hash sign and MUST NOT
182	       interpreted as comments.  For example:

184	       #commentCRLF
185	       aaa,bbb,cccCRLF
186	       #comment 2CRLF
187	       "aaa","this is CRLF
188	       # not a comment","ccc"CRLF

190	2.2.  Default charset and line break values

192	   Since the initial publication of [RFC4180], the default charset for
193	   "text/*" media types has been changed to UTF-8 (as per [RFC6657]) and
194	   [RFC7111].  This document reflects this change and the default
195	   charset for CSV files is now UTF-8.

197	   Although section 4.1.1. of [RFC2046] defines CRLF to denote line
198	   breaks, implementers MAY recognize a single CR or LF as a line break
199	   (similar to section 3.1.1.3 of [RFC7231]).  However, some
200	   implementations MAY use other values.

202	2.3.  ABNF Grammar

204	   The ABNF grammar (as per [RFC5234]) appears as follows:

206	file = *((comment / record) linebreak)

208	comment = hash *comment-data

210	record = first-field *(comma field)

212	linebreak = CR / LF / CRLF

214	first-field = (escaped / first-non-escaped)

216	field = (escaped / non-escaped)

218	escaped = DQUOTE *(data-with-hash / comma / CR / LF / 2DQUOTE) DQUOTE

220	first-non-escaped = [data *data-with-hash]

222	non-escaped = *data-with-hash

224	comma = %x2C

226	hash = %x23

228	comment-data = WSP / %x21-7E / UTF8-data
229	         ; characters without control characters

231	data = WSP / %x21 / %x24-2B / %x2D-7E / UTF8-data
232	         ; characters without control characters, comma, hash and DQUOTE

234	data-with-hash = data / hash

236	CR = %x0D ; as per section B.1 of [RFC5234]

238	DQUOTE = %x22 ; as per section B.1 of [RFC5234]

240	LF = %x0A ; as per section B.1 of [RFC5234]

242	CRLF = CR LF ; as per section B.1 of [RFC5234]

244	HTAB = %x09 ; as per section B.1 of [RFC5234]

246	SP = %x20 ; as per section B.1 of [RFC5234]

248	WSP = SP / HTAB ; as per section B.1 of [RFC5234]

250	UTF8-data = UTF8-2 / UTF8-3 / UTF8-4 ; as per section 4 of [RFC3629]

252	   Note that the authoritative definition of UTF-8 is in [UNICODE].

254	3.  Common implementation concerns

256	   This section describes some common concerns that may arise when
257	   producing or parsing CSV files.  All of these remain out of scope for
258	   this document and are included for awareness.  Implementers may also
259	   use other means to handle these use cases such as [CSVW].

261	3.1.  Null values

263	   Some implementations (such as databases) treat empty fields and null
264	   values differently.  For these implementations, there is a need to
265	   define a special value representing a null.

267	   Example of a CSV file with nulls (if "NULL" is used to mark nulls):

269	   field_name_1,field_name_2,field_name_3CRLF
270	   aaa,bbb,cccCRLF
271	   zzz,NULL,xxxCRLF

273	3.2.  Empty files

275	   Implementers should be aware that in accordance to this specification
276	   a file does not need to contain any comments or records (empty file
277	   with zero bytes).

279	3.3.  Empty lines

281	   This specification recommends but doesn't require having the same
282	   number of fields in every line.  This allows CSV files to have empty
283	   lines without any fields at all.  Some implementations can be
284	   configured to skip empty lines instead of parsing them.

286	   Example of a CSV file with empty lines:

288	   field_name_1,field_name_2,field_name_3CRLF
289	   aaa,bbb,cccCRLF
290	   CRLF
291	   zzz,yyy,xxxCRLF

293	   However, if the records are only made up of one field it is not
294	   possible to differentiate between an empty line, and an empty and
295	   unquoted field.  This differentiation might play an important role in
296	   some implementations such as database exports/imports.

298	   Example of a CSV file with empty lines and only one field per record:

300	   aaa
301	   CRLF
302	   bbbCRLF

304	3.4.  Fields spanning multiple lines

306	   When quoted fields are used, it is possible for a field to span
307	   multiple lines, even when line breaks appear within such field.

309	3.5.  Unique header names

311	   Implementers should be aware that some applications may treat header
312	   values as unique (either case-sensitive or case-insensitive).

314	3.6.  Whitespace outside of quoted fields

316	   When quoted fields are used, this document does not allow whitespace
317	   between double quotes and commas.  Implementers should be aware that
318	   some applications may be more lenient and allow whitespace outside
319	   the double quotes.

321	3.7.  Other field separators

323	   This document defines a comma as a field separator but implementers
324	   should be aware that some applications may use different values,
325	   especially with non-English languages.  Those are outside the scope
326	   of this document and implementers should consult other efforts such
327	   as [CSVW].

329	3.8.  Escaping double quotes

331	   This document prescribes that a double-quote appearing inside a field
332	   must be escaped by preceding it with another double quote.
333	   Implementers should be aware that some applications may choose to use
334	   a different escaping mechanism.

336	3.9.  BOM header

338	   Applications that create text files with unicode character encoding
339	   might write a BOM (byte order mark) header in order to support
340	   multiple unicode encodings (like UTF-16 and UTF-32).  Some
341	   applications might be able to read and properly interpret such a
342	   header, others could break.  Implementors should review section 6 of
343	   [RFC3629] and section 23.8 of [UNICODE].

345	4.  Update to MIME Type Registration of text/csv

347	   The media type registration of "text/csv" should be updated as per
348	   specific fields below:

350	   Encoding considerations:

352	      CSV MIME entities can consist of binary data as per section 4.8 of
353	      [RFC6838].  Although section 4.1.1. of [RFC2046] defines CRLF to
354	      denote line breaks, implementers MAY recognize a single CR or LF
355	      as a line break (similar to section 3.1.1.3 of [RFC7231]).
356	      However, some implementations may use other values.

358	   Published specification:

360	      While numerous private specifications exist for various programs
361	      and systems, there is no single "master" specification for this
362	      format.  An attempt at a common definition can be found in
363	      [RFC4180] and this document.  Implementers should note that both
364	      documents are informational in nature and are not standards.

366	   Optional parameters: charset

368	      The "charset" parameter specifies the charset employed by the CSV
369	      content.  In accordance with [RFC6657], the charset parameter
370	      SHOULD be used, and if it is not present, UTF-8 SHOULD be assumed
371	      as the default (this implies that US- ASCII CSV will work, even
372	      when not specifying the "charset" parameter).  Any charset defined
373	      by IANA for the "text" tree may be used in conjunction with the
374	      "charset" parameter.

376	   Security considerations:

378	      Text/csv consists of nothing but passive text data that should not
379	      pose any direct risks.  However, it is possible that malicious
380	      data may be included in order to exploit buffer overruns or other
381	      bugs in the program processing the text/csv data.

383	      Implementers and users should also be aware that some software
384	      applications may interpret certain characters in the beginning of
385	      CSV fields as referring to code or formulas, thus resulting in
386	      malicious code execution.  This is known as "CSV injection" and
387	      users consuming CSV files should filter out such characters.

389	      The text/csv format provides no confidentiality or integrity
390	      protection, so if such protections are needed they must be
391	      supplied externally.

393	      The fact that software implementing fragment identifiers for CSV
394	      and software not implementing them differs in behavior, and the
395	      fact that different software may show documents or fragments to
396	      users in different ways, can lead to misunderstandings on the part
397	      of users.  Such misunderstandings might be exploited in a way
398	      similar to spoofing or phishing.

400	      Implementers and users of fragment identifiers for CSV text should
401	      also be aware of the security considerations in RFC 3986 [RFC3986]
402	      and RFC 3987 [RFC3987].

404	   Interoperability considerations:

406	      Due to lack of a single specification, there are considerable
407	      differences among implementations.  Implementers should "be
408	      conservative in what you do, be liberal in what you accept from
409	      others" ([RFC0793]) when processing CSV files.  An attempt at a
410	      common definition can be found in Section 2.

412	4.1.  IANA Considerations

414	   IANA is directed to update the MIME type registration for "text/csv"
415	   as per instructions provided in Section 4 of this document and
416	   include a reference to this document within the registration.

418	5.  Security Considerations

420	   All security considerations discussed in Section 4 still apply.

422	6.  Acknowledgments

424	   In addition to everyone thanked previously in [RFC4180], the author
425	   would like to thank acknowledge the contributions of the following
426	   people to this document: Alperen Belgic, Abed BenBrahim, Damon Koach,
427	   Barry Leiba, Oliver Siegmar, Marco Diniz Sousa and Greg Skinner.

429	   A special thank you to L.T.S.

431	7.  References

433	7.1.  Normative References

435	   [RFC2046]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
436	              Extensions (MIME) Part Two: Media Types", RFC 2046,
437	              DOI 10.17487/RFC2046, November 1996,
438	              <https://www.rfc-editor.org/info/rfc2046>.

440	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
441	              Requirement Levels", BCP 14, RFC 2119,
442	              DOI 10.17487/RFC2119, March 1997,
443	              <https://www.rfc-editor.org/info/rfc2119>.

445	   [RFC4180]  Shafranovich, Y., "Common Format and MIME Type for Comma-
446	              Separated Values (CSV) Files", RFC 4180,
447	              DOI 10.17487/RFC4180, October 2005,
448	              <https://www.rfc-editor.org/info/rfc4180>.

450	   [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
451	              Specifications: ABNF", STD 68, RFC 5234,
452	              DOI 10.17487/RFC5234, January 2008,
453	              <https://www.rfc-editor.org/info/rfc5234>.

455	   [RFC6657]  Melnikov, A. and J. Reschke, "Update to MIME regarding
456	              "charset" Parameter Handling in Textual Media Types",
457	              RFC 6657, DOI 10.17487/RFC6657, July 2012,
458	              <https://www.rfc-editor.org/info/rfc6657>.

460	   [RFC6838]  Freed, N., Klensin, J., and T. Hansen, "Media Type
461	              Specifications and Registration Procedures", BCP 13,
462	              RFC 6838, DOI 10.17487/RFC6838, January 2013,
463	              <https://www.rfc-editor.org/info/rfc6838>.

465	   [RFC7111]  Hausenblas, M., Wilde, E., and J. Tennison, "URI Fragment
466	              Identifiers for the text/csv Media Type", RFC 7111,
467	              DOI 10.17487/RFC7111, January 2014,
468	              <https://www.rfc-editor.org/info/rfc7111>.

470	   [RFC7231]  Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer
471	              Protocol (HTTP/1.1): Semantics and Content", RFC 7231,
472	              DOI 10.17487/RFC7231, June 2014,
473	              <https://www.rfc-editor.org/info/rfc7231>.

475	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
476	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
477	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

479	7.2.  Informative References

481	   [ART]      Raymond, E., "The Art of Unix Programming, Chapter 5",
482	              September 2003,
483	              <http://www.catb.org/~esr/writings/taoup/html/
484	              ch05s02.html>.

486	   [CREATIVYST]
487	              Repici, J., "HOW-TO: The Comma Separated Value (CSV) File
488	              Format", 2010,
489	              <http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm>.

491	   [CSVW]     W3C, "CSV on the Web Working Group", 2016,
492	              <https://www.w3.org/2013/csvw/wiki/Main_Page>.

494	   [EDOCEO]   Edoceo, Inc., "Comma Separated Values (CSV) Standard File
495	              Format", 2020, <https://edoceo.com/dev/csv-file-format>.

497	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
498	              RFC 793, DOI 10.17487/RFC0793, September 1981,
499	              <https://www.rfc-editor.org/info/rfc793>.

501	   [RFC1796]  Huitema, C., Postel, J., and S. Crocker, "Not All RFCs are
502	              Standards", RFC 1796, DOI 10.17487/RFC1796, April 1995,
503	              <https://www.rfc-editor.org/info/rfc1796>.

505	   [RFC2026]  Bradner, S., "The Internet Standards Process -- Revision
506	              3", BCP 9, RFC 2026, DOI 10.17487/RFC2026, October 1996,
507	              <https://www.rfc-editor.org/info/rfc2026>.

509	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
510	              10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
511	              2003, <https://www.rfc-editor.org/info/rfc3629>.

513	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
514	              Resource Identifier (URI): Generic Syntax", STD 66,
515	              RFC 3986, DOI 10.17487/RFC3986, January 2005,
516	              <https://www.rfc-editor.org/info/rfc3986>.

518	   [RFC3987]  Duerst, M. and M. Suignard, "Internationalized Resource
519	              Identifiers (IRIs)", RFC 3987, DOI 10.17487/RFC3987,
520	              January 2005, <https://www.rfc-editor.org/info/rfc3987>.

522	   [UNICODE]  The Unicode Consortium, "The Unicode Standard, Version
523	              13.0.0", March 2020,
524	              <https://www.unicode.org/versions/Unicode13.0.0/>.

526	Appendix A.  Major changes since [RFC4180]

528	   *  Added a section clarifying motivation for this document and
529	      standards status

531	   *  Changing default encoding to UTF-8 and adding Unicode to the ABNF
532	      grammar

534	   *  Allowing CR, LF and CRLF for line breaks

536	   *  Allowing HTAB in text data

538	   *  Mandating a line break at the end of the last line in the file

540	   *  Making records and headers optional, thus allowing for an empty
541	      file

543	   *  Adding definition of commented lines

545	   *  Adding a section on common implementation concerns

547	   *  Removed "header" parameter for the MIME type since it is not used

549	Appendix B.  Changes since the -00 draft

551	   *  Added CSV injection to security considerations (#30

553	   *  Added a reference to RFC 7111 (#27)

555	Appendix C.  Changes since the -01 draft

557	   *  No changes yet, refreshed to keep draft alive

559	Appendix D.  Note to Readers

561	      *Note to the RFC Editor:* Please remove this section prior to
562	      publication.

564	   Development of this draft takes place on Github at:
565	   https://github.com/nightwatchcybersecurity/rfc4180-bis

567	   Comments can also be sent to the ART mailing list at:
568	   https://www.ietf.org/mailman/listinfo/art

570	   Full list of changes can be viewed via the IETF document tracker:
571	   https://tools.ietf.org/html/draft-shafranovich-rfc4180-bis

573	Author's Address

575	   Yakov Shafranovich
576	   Nightwatch Cybersecurity

578	   Email: yakov+ietf@nightwatchcybersecurity.com