idnits 2.17.1 draft-shafranovich-rfc4180-bis-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 582 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (19 March 2022) is 768 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 7231 (Obsoleted by RFC 9110) -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Y. Shafranovich 3 Internet-Draft Nightwatch Cybersecurity 4 Intended status: Informational 19 March 2022 5 Expires: 20 September 2022 7 Common Format and MIME Type for Comma-Separated Values (CSV) Files 8 draft-shafranovich-rfc4180-bis-02 10 Abstract 12 This RFC documents the common format used for Comma-Separated Values 13 (CSV) files and updates the associated MIME type "text/csv". 15 Status of This Memo 17 This Internet-Draft is submitted in full conformance with the 18 provisions of BCP 78 and BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF). Note that other groups may also distribute 22 working documents as Internet-Drafts. The list of current Internet- 23 Drafts is at https://datatracker.ietf.org/drafts/current/. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 This Internet-Draft will expire on 20 September 2022. 32 Copyright Notice 34 Copyright (c) 2022 IETF Trust and the persons identified as the 35 document authors. All rights reserved. 37 This document is subject to BCP 78 and the IETF Trust's Legal 38 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 39 license-info) in effect on the date of publication of this document. 40 Please review these documents carefully, as they describe your rights 41 and restrictions with respect to this document. 43 Table of Contents 45 1. Introduction 46 1.1. Terminology 47 1.2. Motivation For and Status of This Document 48 2. Definition of the CSV Format 49 2.1. High level description 50 2.2. Default charset and line break values 51 2.3. ABNF Grammar 52 3. Common implementation concerns 53 3.1. Null values 54 3.2. Empty files 55 3.3. Empty lines 56 3.4. Fields spanning multiple lines 57 3.5. Unique header names 58 3.6. Whitespace outside of quoted fields 59 3.7. Other field separators 60 3.8. Escaping double quotes 61 3.9. BOM header 62 4. Update to MIME Type Registration of text/csv 63 4.1. IANA Considerations 64 5. Security Considerations 65 6. Acknowledgments 66 7. References 67 7.1. Normative References 68 7.2. Informative References 69 Appendix A. Major changes since RFC4180 70 Appendix B. Changes since the -00 draft 71 Appendix C. Changes since the -01 draft 72 Appendix D. Note to Readers 73 Author's Address 75 1. Introduction 77 The comma separated values format (CSV) has been used as a common way 78 to exchange data between disparate systems and applications for many 79 years. Surprisingly, while this format is very popular, it has never 80 been formally documented and didn't have a media type registered. 81 This was addressed in 2005 via publication of [RFC4180] and the 82 concurrent registration of the "text/csv" media type. 84 Since the publication of [RFC4180], the CSV format has evolved and 85 this specification seeks to reflect these changes as well as update 86 the "text/csv" media type registration. 88 1.1. Terminology 90 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 91 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 92 "OPTIONAL" in this document are to be interpreted as described in BCP 93 14 [RFC2119] [RFC8174] when, and only when, they appear in all 94 capitals, as shown here. 96 1.2. Motivation For and Status of This Document 98 The original motivation of [RFC4180] was to provide a reference in 99 order to register the media type "text/csv". It tried to document 100 existing practices at the time based on the approaches used by most 101 implementations. This document continues to do the same, and updates 102 the original document to reflect current practices for generating and 103 consuming of CSV files. 105 Both [RFC4180] and this document are published as informational RFC 106 for the benefit of the Internet community and and not intended to be 107 used as formal standards. Implementers should consult [RFC1796] and 108 [RFC2026] for crucial differences between IETF standards and 109 informational RFCs. 111 2. Definition of the CSV Format 113 While there had been various specifications and implementations for 114 the CSV format (for ex. [CREATIVYST], [EDOCEO], [CSVW] and [ART])), 115 prior to publication of [RFC4180] there is no attempt to provide a 116 common specification. This section documents the format that seems 117 to be followed by most implementations (incorporating changes since 118 the publication of [RFC4180]). 120 2.1. High level description 122 1. Each record is located on a separate line, ended by a line break 123 (CR, LF or CRLF). For example: 125 aaa,bbb,cccCRLF 126 zzz,yyy,xxxCRLF 128 2. The last record in the file MUST have an ending line break. For 129 example: 131 aaa,bbb,cccCRLF 132 zzz,yyy,xxxCRLF 134 3. The first record in the file MAY be an optional header with the 135 same format as normal records. This header will contain names 136 corresponding to the fields in the file and SHOULD contain the 137 same number of fields as the records in the rest of the file. 138 For example: 140 field_name_1,field_name_2,field_name_3CRLF 141 aaa,bbb,cccCRLF 142 zzz,yyy,xxxCRLF 144 4. Within each record, there MAY be one or more fields, separated by 145 commas. Each record SHOULD contain the same number of fields 146 throughout the file. Spaces are considered part of a field and 147 SHOULD NOT be ignored. The last field in the record MUST NOT be 148 followed by a comma. For example: 150 aaa,bbb,cccCRLF 152 5. Each field MAY be enclosed in double quotes (however some 153 programs, do not use double quotes at all). If fields are not 154 enclosed with double quotes, then double quotes MUST NOT appear 155 inside the fields. For example: 157 "aaa","bbb","ccc"CRLF 158 zzz,yyy,xxxCRLF 160 6. Fields containing line breaks (CR, LF or CRLF), double quotes, or 161 commas MUST be enclosed in double-quotes. The same applies for 162 the first field of a record that starts with a hash to avoid the 163 field from being parsed as a comment. For example: 165 "aaa","b CRLF 166 bb","ccc"CRLF 167 zzz,yyy,xxxCRLF 168 "#aaa",#bbb,cccCRLF 170 7. A double-quote appearing inside a field MUST be escaped by 171 preceding it with another double quote. For example: 173 "aaa","b""bb","ccc"CRLF 175 8. A hash sign MAY be used to mark lines that are meant to be 176 commented lines. A commented line can contain any whitespace or 177 visible character until it is terminated by a line break (CR, LF 178 or CRLF). A comment line MAY appear in any line of the file 179 (before or after an OPTIONAL header) but MUST NOT be mistaken 180 with a subsequent line of a multi-line field. Subsequent lines 181 of multi-line fields can start with a hash sign and MUST NOT 182 interpreted as comments. For example: 184 #commentCRLF 185 aaa,bbb,cccCRLF 186 #comment 2CRLF 187 "aaa","this is CRLF 188 # not a comment","ccc"CRLF 190 2.2. Default charset and line break values 192 Since the initial publication of [RFC4180], the default charset for 193 "text/*" media types has been changed to UTF-8 (as per [RFC6657]) and 194 [RFC7111]. This document reflects this change and the default 195 charset for CSV files is now UTF-8. 197 Although section 4.1.1. of [RFC2046] defines CRLF to denote line 198 breaks, implementers MAY recognize a single CR or LF as a line break 199 (similar to section 3.1.1.3 of [RFC7231]). However, some 200 implementations MAY use other values. 202 2.3. ABNF Grammar 204 The ABNF grammar (as per [RFC5234]) appears as follows: 206 file = *((comment / record) linebreak) 208 comment = hash *comment-data 210 record = first-field *(comma field) 212 linebreak = CR / LF / CRLF 214 first-field = (escaped / first-non-escaped) 216 field = (escaped / non-escaped) 218 escaped = DQUOTE *(data-with-hash / comma / CR / LF / 2DQUOTE) DQUOTE 220 first-non-escaped = [data *data-with-hash] 222 non-escaped = *data-with-hash 224 comma = %x2C 226 hash = %x23 228 comment-data = WSP / %x21-7E / UTF8-data 229 ; characters without control characters 231 data = WSP / %x21 / %x24-2B / %x2D-7E / UTF8-data 232 ; characters without control characters, comma, hash and DQUOTE 234 data-with-hash = data / hash 236 CR = %x0D ; as per section B.1 of [RFC5234] 238 DQUOTE = %x22 ; as per section B.1 of [RFC5234] 240 LF = %x0A ; as per section B.1 of [RFC5234] 242 CRLF = CR LF ; as per section B.1 of [RFC5234] 244 HTAB = %x09 ; as per section B.1 of [RFC5234] 246 SP = %x20 ; as per section B.1 of [RFC5234] 248 WSP = SP / HTAB ; as per section B.1 of [RFC5234] 250 UTF8-data = UTF8-2 / UTF8-3 / UTF8-4 ; as per section 4 of [RFC3629] 252 Note that the authoritative definition of UTF-8 is in [UNICODE]. 254 3. Common implementation concerns 256 This section describes some common concerns that may arise when 257 producing or parsing CSV files. All of these remain out of scope for 258 this document and are included for awareness. Implementers may also 259 use other means to handle these use cases such as [CSVW]. 261 3.1. Null values 263 Some implementations (such as databases) treat empty fields and null 264 values differently. For these implementations, there is a need to 265 define a special value representing a null. 267 Example of a CSV file with nulls (if "NULL" is used to mark nulls): 269 field_name_1,field_name_2,field_name_3CRLF 270 aaa,bbb,cccCRLF 271 zzz,NULL,xxxCRLF 273 3.2. Empty files 275 Implementers should be aware that in accordance to this specification 276 a file does not need to contain any comments or records (empty file 277 with zero bytes). 279 3.3. Empty lines 281 This specification recommends but doesn't require having the same 282 number of fields in every line. This allows CSV files to have empty 283 lines without any fields at all. Some implementations can be 284 configured to skip empty lines instead of parsing them. 286 Example of a CSV file with empty lines: 288 field_name_1,field_name_2,field_name_3CRLF 289 aaa,bbb,cccCRLF 290 CRLF 291 zzz,yyy,xxxCRLF 293 However, if the records are only made up of one field it is not 294 possible to differentiate between an empty line, and an empty and 295 unquoted field. This differentiation might play an important role in 296 some implementations such as database exports/imports. 298 Example of a CSV file with empty lines and only one field per record: 300 aaa 301 CRLF 302 bbbCRLF 304 3.4. Fields spanning multiple lines 306 When quoted fields are used, it is possible for a field to span 307 multiple lines, even when line breaks appear within such field. 309 3.5. Unique header names 311 Implementers should be aware that some applications may treat header 312 values as unique (either case-sensitive or case-insensitive). 314 3.6. Whitespace outside of quoted fields 316 When quoted fields are used, this document does not allow whitespace 317 between double quotes and commas. Implementers should be aware that 318 some applications may be more lenient and allow whitespace outside 319 the double quotes. 321 3.7. Other field separators 323 This document defines a comma as a field separator but implementers 324 should be aware that some applications may use different values, 325 especially with non-English languages. Those are outside the scope 326 of this document and implementers should consult other efforts such 327 as [CSVW]. 329 3.8. Escaping double quotes 331 This document prescribes that a double-quote appearing inside a field 332 must be escaped by preceding it with another double quote. 333 Implementers should be aware that some applications may choose to use 334 a different escaping mechanism. 336 3.9. BOM header 338 Applications that create text files with unicode character encoding 339 might write a BOM (byte order mark) header in order to support 340 multiple unicode encodings (like UTF-16 and UTF-32). Some 341 applications might be able to read and properly interpret such a 342 header, others could break. Implementors should review section 6 of 343 [RFC3629] and section 23.8 of [UNICODE]. 345 4. Update to MIME Type Registration of text/csv 347 The media type registration of "text/csv" should be updated as per 348 specific fields below: 350 Encoding considerations: 352 CSV MIME entities can consist of binary data as per section 4.8 of 353 [RFC6838]. Although section 4.1.1. of [RFC2046] defines CRLF to 354 denote line breaks, implementers MAY recognize a single CR or LF 355 as a line break (similar to section 3.1.1.3 of [RFC7231]). 356 However, some implementations may use other values. 358 Published specification: 360 While numerous private specifications exist for various programs 361 and systems, there is no single "master" specification for this 362 format. An attempt at a common definition can be found in 363 [RFC4180] and this document. Implementers should note that both 364 documents are informational in nature and are not standards. 366 Optional parameters: charset 368 The "charset" parameter specifies the charset employed by the CSV 369 content. In accordance with [RFC6657], the charset parameter 370 SHOULD be used, and if it is not present, UTF-8 SHOULD be assumed 371 as the default (this implies that US- ASCII CSV will work, even 372 when not specifying the "charset" parameter). Any charset defined 373 by IANA for the "text" tree may be used in conjunction with the 374 "charset" parameter. 376 Security considerations: 378 Text/csv consists of nothing but passive text data that should not 379 pose any direct risks. However, it is possible that malicious 380 data may be included in order to exploit buffer overruns or other 381 bugs in the program processing the text/csv data. 383 Implementers and users should also be aware that some software 384 applications may interpret certain characters in the beginning of 385 CSV fields as referring to code or formulas, thus resulting in 386 malicious code execution. This is known as "CSV injection" and 387 users consuming CSV files should filter out such characters. 389 The text/csv format provides no confidentiality or integrity 390 protection, so if such protections are needed they must be 391 supplied externally. 393 The fact that software implementing fragment identifiers for CSV 394 and software not implementing them differs in behavior, and the 395 fact that different software may show documents or fragments to 396 users in different ways, can lead to misunderstandings on the part 397 of users. Such misunderstandings might be exploited in a way 398 similar to spoofing or phishing. 400 Implementers and users of fragment identifiers for CSV text should 401 also be aware of the security considerations in RFC 3986 [RFC3986] 402 and RFC 3987 [RFC3987]. 404 Interoperability considerations: 406 Due to lack of a single specification, there are considerable 407 differences among implementations. Implementers should "be 408 conservative in what you do, be liberal in what you accept from 409 others" ([RFC0793]) when processing CSV files. An attempt at a 410 common definition can be found in Section 2. 412 4.1. IANA Considerations 414 IANA is directed to update the MIME type registration for "text/csv" 415 as per instructions provided in Section 4 of this document and 416 include a reference to this document within the registration. 418 5. Security Considerations 420 All security considerations discussed in Section 4 still apply. 422 6. Acknowledgments 424 In addition to everyone thanked previously in [RFC4180], the author 425 would like to thank acknowledge the contributions of the following 426 people to this document: Alperen Belgic, Abed BenBrahim, Damon Koach, 427 Barry Leiba, Oliver Siegmar, Marco Diniz Sousa and Greg Skinner. 429 A special thank you to L.T.S. 431 7. References 433 7.1. Normative References 435 [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 436 Extensions (MIME) Part Two: Media Types", RFC 2046, 437 DOI 10.17487/RFC2046, November 1996, 438 . 440 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 441 Requirement Levels", BCP 14, RFC 2119, 442 DOI 10.17487/RFC2119, March 1997, 443 . 445 [RFC4180] Shafranovich, Y., "Common Format and MIME Type for Comma- 446 Separated Values (CSV) Files", RFC 4180, 447 DOI 10.17487/RFC4180, October 2005, 448 . 450 [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax 451 Specifications: ABNF", STD 68, RFC 5234, 452 DOI 10.17487/RFC5234, January 2008, 453 . 455 [RFC6657] Melnikov, A. and J. Reschke, "Update to MIME regarding 456 "charset" Parameter Handling in Textual Media Types", 457 RFC 6657, DOI 10.17487/RFC6657, July 2012, 458 . 460 [RFC6838] Freed, N., Klensin, J., and T. Hansen, "Media Type 461 Specifications and Registration Procedures", BCP 13, 462 RFC 6838, DOI 10.17487/RFC6838, January 2013, 463 . 465 [RFC7111] Hausenblas, M., Wilde, E., and J. Tennison, "URI Fragment 466 Identifiers for the text/csv Media Type", RFC 7111, 467 DOI 10.17487/RFC7111, January 2014, 468 . 470 [RFC7231] Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer 471 Protocol (HTTP/1.1): Semantics and Content", RFC 7231, 472 DOI 10.17487/RFC7231, June 2014, 473 . 475 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 476 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 477 May 2017, . 479 7.2. Informative References 481 [ART] Raymond, E., "The Art of Unix Programming, Chapter 5", 482 September 2003, 483 . 486 [CREATIVYST] 487 Repici, J., "HOW-TO: The Comma Separated Value (CSV) File 488 Format", 2010, 489 . 491 [CSVW] W3C, "CSV on the Web Working Group", 2016, 492 . 494 [EDOCEO] Edoceo, Inc., "Comma Separated Values (CSV) Standard File 495 Format", 2020, . 497 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 498 RFC 793, DOI 10.17487/RFC0793, September 1981, 499 . 501 [RFC1796] Huitema, C., Postel, J., and S. Crocker, "Not All RFCs are 502 Standards", RFC 1796, DOI 10.17487/RFC1796, April 1995, 503 . 505 [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 506 3", BCP 9, RFC 2026, DOI 10.17487/RFC2026, October 1996, 507 . 509 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 510 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 511 2003, . 513 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 514 Resource Identifier (URI): Generic Syntax", STD 66, 515 RFC 3986, DOI 10.17487/RFC3986, January 2005, 516 . 518 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource 519 Identifiers (IRIs)", RFC 3987, DOI 10.17487/RFC3987, 520 January 2005, . 522 [UNICODE] The Unicode Consortium, "The Unicode Standard, Version 523 13.0.0", March 2020, 524 . 526 Appendix A. Major changes since [RFC4180] 528 * Added a section clarifying motivation for this document and 529 standards status 531 * Changing default encoding to UTF-8 and adding Unicode to the ABNF 532 grammar 534 * Allowing CR, LF and CRLF for line breaks 536 * Allowing HTAB in text data 538 * Mandating a line break at the end of the last line in the file 540 * Making records and headers optional, thus allowing for an empty 541 file 543 * Adding definition of commented lines 545 * Adding a section on common implementation concerns 547 * Removed "header" parameter for the MIME type since it is not used 549 Appendix B. Changes since the -00 draft 551 * Added CSV injection to security considerations (#30 553 * Added a reference to RFC 7111 (#27) 555 Appendix C. Changes since the -01 draft 557 * No changes yet, refreshed to keep draft alive 559 Appendix D. Note to Readers 561 *Note to the RFC Editor:* Please remove this section prior to 562 publication. 564 Development of this draft takes place on Github at: 565 https://github.com/nightwatchcybersecurity/rfc4180-bis 567 Comments can also be sent to the ART mailing list at: 568 https://www.ietf.org/mailman/listinfo/art 570 Full list of changes can be viewed via the IETF document tracker: 571 https://tools.ietf.org/html/draft-shafranovich-rfc4180-bis 573 Author's Address 575 Yakov Shafranovich 576 Nightwatch Cybersecurity 578 Email: yakov+ietf@nightwatchcybersecurity.com