idnits 2.17.1 draft-shafranovich-rfc4180-bis-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 532 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (26 March 2021) is 1124 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 7231 (Obsoleted by RFC 9110) -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Y. Shafranovich 3 Internet-Draft Nightwatch Cybersecurity 4 Intended status: Informational 26 March 2021 5 Expires: 27 September 2021 7 Common Format and MIME Type for Comma-Separated Values (CSV) Files 8 draft-shafranovich-rfc4180-bis-00 10 Abstract 12 This RFC documents the common format used for Comma-Separated Values 13 (CSV) files and updates the associated MIME type "text/csv". 15 Status of This Memo 17 This Internet-Draft is submitted in full conformance with the 18 provisions of BCP 78 and BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF). Note that other groups may also distribute 22 working documents as Internet-Drafts. The list of current Internet- 23 Drafts is at https://datatracker.ietf.org/drafts/current/. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 This Internet-Draft will expire on 27 September 2021. 32 Copyright Notice 34 Copyright (c) 2021 IETF Trust and the persons identified as the 35 document authors. All rights reserved. 37 This document is subject to BCP 78 and the IETF Trust's Legal 38 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 39 license-info) in effect on the date of publication of this document. 40 Please review these documents carefully, as they describe your rights 41 and restrictions with respect to this document. Code Components 42 extracted from this document must include Simplified BSD License text 43 as described in Section 4.e of the Trust Legal Provisions and are 44 provided without warranty as described in the Simplified BSD License. 46 Table of Contents 48 1. Introduction 49 1.1. Terminology 50 1.2. Motivation For and Status of This Document 51 2. Definition of the CSV Format 52 2.1. High level description 53 2.2. Default charset and line break values 54 2.3. ABNF Grammar 55 3. Common implementation concerns 56 3.1. Null values 57 3.2. Empty files 58 3.3. Empty lines 59 3.4. Fields spanning multiple lines 60 3.5. Unique header names 61 3.6. Whitespace outside of quoted fields 62 3.7. Other field separators 63 3.8. Escaping double quotes 64 3.9. BOM header 65 4. Update to MIME Type Registration of text/csv 66 4.1. IANA Considerations 67 5. Security Considerations 68 6. Acknowledgments 69 7. References 70 7.1. Normative References 71 7.2. Informative References 72 Appendix A. Major changes since RFC4180 73 Appendix B. Note to Readers 74 Author's Address 76 1. Introduction 78 The comma separated values format (CSV) has been used as a common way 79 to exchange data between disparate systems and applications for many 80 years. Surprisingly, while this format is very popular, it has never 81 been formally documented and didn't have a media type registered. 82 This was addressed in 2005 via publication of [RFC4180] and the 83 concurrent registration of the "text/csv" media type. 85 Since the publication of [RFC4180], the CSV format has evolved and 86 this specification seeks to reflect these changes as well as update 87 the "text/csv" media type registration. 89 1.1. Terminology 91 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 92 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 93 "OPTIONAL" in this document are to be interpreted as described in BCP 94 14 [RFC2119] [RFC8174] when, and only when, they appear in all 95 capitals, as shown here. 97 1.2. Motivation For and Status of This Document 99 The original motivation of [RFC4180] was to provide a reference in 100 order to register the media type "text/csv". It tried to document 101 existing practices at the time based on the approaches used by most 102 implementations. This document continues to do the same, and updates 103 the original document to reflect current practices for generating and 104 consuming of CSV files. 106 Both [RFC4180] and this document are published as informational RFC 107 for the benefit of the Internet community and and not intended to be 108 used as formal standards. Implementers should consult [RFC1796] and 109 [RFC2026] for crucial differences between IETF standards and 110 informational RFCs. 112 2. Definition of the CSV Format 114 While there had been various specifications and implementations for 115 the CSV format (for ex. [CREATIVYST], [EDOCEO], [CSVW] and [ART])), 116 prior to publication of [RFC4180] there is no attempt to provide a 117 common specification. This section documents the format that seems 118 to be followed by most implementations (incorporating changes since 119 the publication of [RFC4180]). 121 2.1. High level description 123 1. Each record is located on a separate line, ended by a line break 124 (CR, LF or CRLF). For example: 126 aaa,bbb,cccCRLF 127 zzz,yyy,xxxCRLF 129 2. The last record in the file MUST have an ending line break. For 130 example: 132 aaa,bbb,cccCRLF 133 zzz,yyy,xxxCRLF 135 3. The first record in the file MAY be an optional header with the 136 same format as normal records. This header will contain names 137 corresponding to the fields in the file and SHOULD contain the 138 same number of fields as the records in the rest of the file. 139 For example: 141 field_name_1,field_name_2,field_name_3CRLF 142 aaa,bbb,cccCRLF 143 zzz,yyy,xxxCRLF 145 4. Within each record, there MAY be one or more fields, separated by 146 commas. Each record SHOULD contain the same number of fields 147 throughout the file. Spaces are considered part of a field and 148 SHOULD NOT be ignored. The last field in the record MUST NOT be 149 followed by a comma. For example: 151 aaa,bbb,cccCRLF 153 5. Each field MAY be enclosed in double quotes (however some 154 programs, do not use double quotes at all). If fields are not 155 enclosed with double quotes, then double quotes MUST NOT appear 156 inside the fields. For example: 158 "aaa","bbb","ccc"CRLF 159 zzz,yyy,xxxCRLF 161 6. Fields containing line breaks (CR, LF or CRLF), double quotes, or 162 commas MUST be enclosed in double-quotes. The same applies for 163 the first field of a record that starts with a hash to avoid the 164 field from being parsed as a comment. For example: 166 "aaa","b CRLF 167 bb","ccc"CRLF 168 zzz,yyy,xxxCRLF 169 "#aaa",#bbb,cccCRLF 171 7. A double-quote appearing inside a field MUST be escaped by 172 preceding it with another double quote. For example: 174 "aaa","b""bb","ccc"CRLF 176 8. A hash sign MAY be used to mark lines that are meant to be 177 commented lines. A commented line can contain any whitespace or 178 visible character until it is terminated by a line break (CR, LF 179 or CRLF). A comment line MAY appear in any line of the file 180 (before or after an OPTIONAL header) but MUST NOT be mistaken 181 with a subsequent line of a multi-line field. Subsequent lines 182 of multi-line fields can start with a hash sign and MUST NOT 183 interpreted as comments. For example: 185 #commentCRLF 186 aaa,bbb,cccCRLF 187 #comment 2CRLF 188 "aaa","this is CRLF 189 # not a comment","ccc"CRLF 191 2.2. Default charset and line break values 193 Since the initial publication of [RFC4180], the default charset for 194 "text/*" media types has been changed to UTF-8 (as per [RFC6657]). 195 This document reflects this change and the default charset for CSV 196 files is now UTF-8. 198 Although section 4.1.1. of [RFC2046] defines CRLF to denote line 199 breaks, implementers MAY recognize a single CR or LF as a line break 200 (similar to section 3.1.1.3 of [RFC7231]). However, some 201 implementations MAY use other values. 203 2.3. ABNF Grammar 205 The ABNF grammar (as per [RFC5234]) appears as follows: 207 file = *((comment / record) linebreak) 209 comment = hash *comment-data 211 record = first-field *(comma field) 213 linebreak = CR / LF / CRLF 215 first-field = (escaped / first-non-escaped) 217 field = (escaped / non-escaped) 219 escaped = DQUOTE *(data-with-hash / comma / CR / LF / 2DQUOTE) DQUOTE 221 first-non-escaped = [data *data-with-hash] 223 non-escaped = *data-with-hash 225 comma = %x2C 227 hash = %x23 229 comment-data = WSP / %x21-7E / UTF8-data 230 ; characters without control characters 232 data = WSP / %x21 / %x24-2B / %x2D-7E / UTF8-data 233 ; characters without control characters, comma, hash and DQUOTE 235 data-with-hash = data / hash 237 CR = %x0D ; as per section B.1 of [RFC5234] 239 DQUOTE = %x22 ; as per section B.1 of [RFC5234] 241 LF = %x0A ; as per section B.1 of [RFC5234] 243 CRLF = CR LF ; as per section B.1 of [RFC5234] 245 HTAB = %x09 ; as per section B.1 of [RFC5234] 247 SP = %x20 ; as per section B.1 of [RFC5234] 249 WSP = SP / HTAB ; as per section B.1 of [RFC5234] 251 UTF8-data = UTF8-2 / UTF8-3 / UTF8-4 ; as per section 4 of [RFC3629] 253 Note that the authoritative definition of UTF-8 is in [UNICODE]. 255 3. Common implementation concerns 257 This section describes some common concerns that may arise when 258 producing or parsing CSV files. All of these remain out of scope for 259 this document and are included for awareness. Implementers may also 260 use other means to handle these use cases such as [CSVW]. 262 3.1. Null values 264 Some implementations (such as databases) treat empty fields and null 265 values differently. For these implementations, there is a need to 266 define a special value representing a null. 268 Example of a CSV file with nulls (if "NULL" is used to mark nulls): 270 field_name_1,field_name_2,field_name_3CRLF 271 aaa,bbb,cccCRLF 272 zzz,NULL,xxxCRLF 274 3.2. Empty files 276 Implementers should be aware that in accordance to this specification 277 a file does not need to contain any comments or records (empty file 278 with zero bytes). 280 3.3. Empty lines 282 This specification recommends but doesn't require having the same 283 number of fields in every line. This allows CSV files to have empty 284 lines without any fields at all. Some implementations can be 285 configured to skip empty lines instead of parsing them. 287 Example of a CSV file with empty lines: 289 field_name_1,field_name_2,field_name_3CRLF 290 aaa,bbb,cccCRLF 291 CRLF 292 zzz,yyy,xxxCRLF 294 However, if the records are only made up of one field it is not 295 possible to differentiate between an empty line, and an empty and 296 unquoted field. This differentiation might play an important role in 297 some implementations such as database exports/imports. 299 Example of a CSV file with empty lines and only one field per record: 301 aaa 302 CRLF 303 bbbCRLF 305 3.4. Fields spanning multiple lines 307 When quoted fields are used, it is possible for a field to span 308 multiple lines, even when line breaks appear within such field. 310 3.5. Unique header names 312 Implementers should be aware that some applications may treat header 313 values as unique (either case-sensitive or case-insensitive). 315 3.6. Whitespace outside of quoted fields 317 When quoted fields are used, this document does not allow whitespace 318 between double quotes and commas. Implementers should be aware that 319 some applications may be more lenient and allow whitespace outside 320 the double quotes. 322 3.7. Other field separators 324 This document defines a comma as a field separator but implementers 325 should be aware that some applications may use different values, 326 especially with non-English languages. Those are outside the scope 327 of this document and implementers should consult other efforts such 328 as [CSVW]. 330 3.8. Escaping double quotes 332 This document prescribes that a double-quote appearing inside a field 333 must be escaped by preceding it with another double quote. 334 Implementers should be aware that some applications may choose to use 335 a different escaping mechanism. 337 3.9. BOM header 339 Applications that create text files with unicode character encoding 340 might write a BOM (byte order mark) header in order to support 341 multiple unicode encodings (like UTF-16 and UTF-32). Some 342 applications might be able to read and properly interpret such a 343 header, others could break. Implementors should review section 6 of 344 [RFC3629] and section 23.8 of [UNICODE]. 346 4. Update to MIME Type Registration of text/csv 348 The media type registration of "text/csv" should be updated as per 349 specific fields below: 351 Encoding considerations: 353 CSV MIME entities can consist of binary data as per section 4.8 of 354 [RFC6838]. Although section 4.1.1. of [RFC2046] defines CRLF to 355 denote line breaks, implementers MAY recognize a single CR or LF 356 as a line break (similar to section 3.1.1.3 of [RFC7231]). 357 However, some implementations may use other values. 359 Published specification: 361 While numerous private specifications exist for various programs 362 and systems, there is no single "master" specification for this 363 format. An attempt at a common definition can be found in 364 [RFC4180] and this document. Implementers should note that both 365 documents are informational in nature and are not standards. 367 Optional parameters: charset 369 The "charset" parameter specifies the charset employed by the CSV 370 content. In accordance with [RFC6657], the charset parameter 371 SHOULD be used, and if it is not present, UTF-8 SHOULD be assumed 372 as the default (this implies that US- ASCII CSV will work, even 373 when not specifying the "charset" parameter). Any charset defined 374 by IANA for the "text" tree may be used in conjunction with the 375 "charset" parameter. 377 Interoperability considerations: 379 Due to lack of a single specification, there are considerable 380 differences among implementations. Implementers should "be 381 conservative in what you do, be liberal in what you accept from 382 others" ([RFC0793]) when processing CSV files. An attempt at a 383 common definition can be found in Section 2. 385 4.1. IANA Considerations 387 IANA is directed to update the MIME type registration for "text/csv" 388 as per instructions provided in Section 4 of this document and 389 include a reference to this document within the registration. 391 5. Security Considerations 393 All security considerations as discussed in [RFC4180] still apply. 395 6. Acknowledgments 397 In addition to everyone thanked previously in [RFC4180], the author 398 would like to thank acknowledge the contributions of the following 399 people to this document: Alperen Belgic, Abed BenBrahim, Benjamin 400 Kaduk, Damon Koach, Barry Leiba, Oliver Siegmar, Marco Diniz Sousa 401 and Greg Skinner. 403 A special thank you to L.T.S. 405 7. References 407 7.1. Normative References 409 [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 410 Extensions (MIME) Part Two: Media Types", RFC 2046, 411 DOI 10.17487/RFC2046, November 1996, 412 . 414 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 415 Requirement Levels", BCP 14, RFC 2119, 416 DOI 10.17487/RFC2119, March 1997, 417 . 419 [RFC4180] Shafranovich, Y., "Common Format and MIME Type for Comma- 420 Separated Values (CSV) Files", RFC 4180, 421 DOI 10.17487/RFC4180, October 2005, 422 . 424 [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax 425 Specifications: ABNF", STD 68, RFC 5234, 426 DOI 10.17487/RFC5234, January 2008, 427 . 429 [RFC6657] Melnikov, A. and J. Reschke, "Update to MIME regarding 430 "charset" Parameter Handling in Textual Media Types", 431 RFC 6657, DOI 10.17487/RFC6657, July 2012, 432 . 434 [RFC6838] Freed, N., Klensin, J., and T. Hansen, "Media Type 435 Specifications and Registration Procedures", BCP 13, 436 RFC 6838, DOI 10.17487/RFC6838, January 2013, 437 . 439 [RFC7231] Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer 440 Protocol (HTTP/1.1): Semantics and Content", RFC 7231, 441 DOI 10.17487/RFC7231, June 2014, 442 . 444 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 445 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 446 May 2017, . 448 7.2. Informative References 450 [ART] Raymond, E., "The Art of Unix Programming, Chapter 5", 451 September 2003, 452 . 455 [CREATIVYST] 456 Repici, J., "HOW-TO: The Comma Separated Value (CSV) File 457 Format", 2010, 458 . 460 [CSVW] W3C, "CSV on the Web Working Group", 2016, 461 . 463 [EDOCEO] Edoceo, Inc., "Comma Separated Values (CSV) Standard File 464 Format", 2020, . 466 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 467 RFC 793, DOI 10.17487/RFC0793, September 1981, 468 . 470 [RFC1796] Huitema, C., Postel, J., and S. Crocker, "Not All RFCs are 471 Standards", RFC 1796, DOI 10.17487/RFC1796, April 1995, 472 . 474 [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 475 3", BCP 9, RFC 2026, DOI 10.17487/RFC2026, October 1996, 476 . 478 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 479 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 480 2003, . 482 [UNICODE] The Unicode Consortium, "The Unicode Standard, Version 483 13.0.0", March 2020, 484 . 486 Appendix A. Major changes since [RFC4180] 488 * Added a section clarifying motivation for this document and 489 standards status 491 * Changing default encoding to UTF-8 and adding Unicode to the ABNF 492 grammar 494 * Allowing CR, LF and CRLF for line breaks 496 * Allowing HTAB in text data 498 * Mandating a line break at the end of the last line in the file 500 * Making records and headers optional, thus allowing for an empty 501 file 503 * Adding definition of commented lines 505 * Adding a section on common implementation concerns 507 * Removed "header" parameter for the MIME type 509 Appendix B. Note to Readers 511 *Note to the RFC Editor:* Please remove this section prior to 512 publication. 514 Development of this draft takes place on Github at: 515 https://github.com/nightwatchcyber/rfc4180-bis 517 Comments can also be sent to the ART mailing list at: 518 https://www.ietf.org/mailman/listinfo/art 520 Full list of changes can be viewed via the IETF document tracker: 521 https://tools.ietf.org/html/draft-shafranovich-rfc4180-bis 523 Author's Address 525 Yakov Shafranovich 526 Nightwatch Cybersecurity 528 Email: yakov+ietf@nightwatchcybersecurity.com