idnits 2.17.1 draft-kunze-bagit-17.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (September 17, 2018) is 2046 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 4234 (Obsoleted by RFC 5234) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Kunze 3 Internet-Draft California Digital Library 4 Intended status: Informational J. Littman 5 Expires: March 21, 2019 Stanford Libraries 6 E. Madden 7 Library of Congress 8 J. Scancella 10 C. Adams 11 Library of Congress 12 September 17, 2018 14 The BagIt File Packaging Format (V1.0) 15 draft-kunze-bagit-17 17 Abstract 19 This document describes BagIt, a set of hierarchical file layout 20 conventions for storage and transfer of arbitrary digital content. A 21 "bag" has just enough structure to enclose descriptive metadata 22 "tags" and a file "payload" but does not require knowledge of the 23 payload's internal semantics. This BagIt format is suitable for 24 reliable storage and transfer. 26 Status of This Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at https://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on March 21, 2019. 43 Copyright Notice 45 Copyright (c) 2018 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (https://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 61 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.2. Requirements . . . . . . . . . . . . . . . . . . . . . . 4 63 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 64 2. Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 5 65 2.1. Required Elements . . . . . . . . . . . . . . . . . . . . 5 66 2.1.1. Bag Declaration: bagit.txt . . . . . . . . . . . . . 6 67 2.1.2. Payload Directory: data/ . . . . . . . . . . . . . . 6 68 2.1.3. Payload Manifest: manifest-algorithm.txt . . . . . . 6 69 2.2. Optional Elements . . . . . . . . . . . . . . . . . . . . 7 70 2.2.1. Tag Manifest: tagmanifest-algorithm.txt . . . . . . . 7 71 2.2.2. Bag Metadata: bag-info.txt . . . . . . . . . . . . . 8 72 2.2.3. Fetch File: fetch.txt . . . . . . . . . . . . . . . . 10 73 2.2.4. Other Tag Files . . . . . . . . . . . . . . . . . . . 11 74 2.3. Text Tag File Format . . . . . . . . . . . . . . . . . . 11 75 2.4. Bag Checksum Algorithms . . . . . . . . . . . . . . . . . 12 76 3. Complete and Valid bags . . . . . . . . . . . . . . . . . . . 13 77 4. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 13 78 4.1. Example of a basic bag . . . . . . . . . . . . . . . . . 13 79 4.2. Example bag using fetch.txt . . . . . . . . . . . . . . . 14 80 5. Security Considerations . . . . . . . . . . . . . . . . . . . 15 81 5.1. Special directory characters . . . . . . . . . . . . . . 15 82 5.2. Control of URLs in fetch.txt . . . . . . . . . . . . . . 15 83 5.3. File sizes in fetch.txt . . . . . . . . . . . . . . . . . 15 84 5.4. Attacks on payload file content . . . . . . . . . . . . . 16 85 6. Practical Considerations (non-normative) . . . . . . . . . . 16 86 6.1. Interoperability . . . . . . . . . . . . . . . . . . . . 16 87 6.1.1. Filename normalization . . . . . . . . . . . . . . . 16 88 6.1.2. Windows and Unix file naming . . . . . . . . . . . . 18 89 6.1.3. Legacy checksum tools . . . . . . . . . . . . . . . . 19 90 7. Augmented Backus-Naur Form (non-normative) . . . . . . . . . 19 91 7.1. Bag Declaration: bagit.txt . . . . . . . . . . . . . . . 19 92 7.2. Payload Manifest: manifest-algorithm.txt . . . . . . . . 20 93 7.3. Bag Metadata: bag-info.txt . . . . . . . . . . . . . . . 20 94 7.4. Fetch File: fetch.txt . . . . . . . . . . . . . . . . . . 20 95 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 20 96 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 21 97 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 98 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 21 99 11.1. Normative References . . . . . . . . . . . . . . . . . . 21 100 11.2. Informative References . . . . . . . . . . . . . . . . . 22 101 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 103 1. Introduction 105 1.1. Purpose 107 BagIt is a set of hierarchical file layout conventions designed to 108 support storage and transfer of arbitrary digital content. A bag 109 consists of a directory containing the payload files and other 110 accompanying metadata files known as "tag" files. The "tags" are 111 metadata files intended to facilitate and document the storage and 112 transfer of the bag. Processing a bag does not require any 113 understanding of the payload file contents and the payload files can 114 be accessed without processing the BagIt metadata. 116 The name, BagIt, is inspired by the "enclose and deposit" method 117 [ENCDEP], sometimes referred to as "bag it and tag it". BagIt 118 differs from serialized archive formats such as MIME, TAR, or ZIP in 119 two general areas: 121 1. Strong integrity assurances. The format supports cryptographic- 122 quality hash algorithms (see Section 2.4) and allows for in-place 123 upgrades to add additional manifests using stronger algorithms 124 without breaking backwards compatibility. This provides high 125 levels of confidence against data corruption but is not designed 126 to be secure against active attacks. 128 2. Direct file access. Because BagIt specifies an actual filesystem 129 hierarchy rather than a serialized representation of one, files 130 can be accessed using standard operating system utilities, 131 implementations do not need to process a potentially large 132 archive file to extract a subset of data, and the format imposes 133 no size limits for either individual files or a bag. 135 BagIt is widely used for preserving digital assets originating from 136 different domains. Organizations involved in digital preservation 137 with BagIt include the Library of Congress, Dryad Data Repository, 138 NSF DataONE, and the Rockefeller Archive Center. Software 139 implementations are available for many languages including Python, 140 Ruby, Java, Perl, and PHP. It is also used in the libraries of many 141 universities, such as Cornell, Purdue, Stanford, Ghent University, 142 New York University, and the University of California. 144 1.2. Requirements 146 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 147 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 148 "OPTIONAL" in this document are to be interpreted as described in BCP 149 14 [RFC2119] [RFC8174] when, and only when, they appear in all 150 capitals as shown here. 152 Implementers are strongly encouraged to review the interoperability 153 considerations described in Section 6.1. 155 1.3. Terminology 157 The following terms have precise definitions as used in this 158 document: 160 bag A set of opaque files contained within the structure defined by 161 this document. 163 bag declaration The file required to be in all bags conforming to 164 this document. Contains values necessary to process the rest of a 165 bag. See Section 2.1.1. 167 bag checksum algorithm The name of a cryptographic checksum 168 algorithm which has been normalized for use in a manifest or tag 169 manifest file name (e.g. "sha512") as described in Section 2.4. 171 manifest A tag file thats maps filepaths to checksums. A manifest 172 can be a payload manifest Section 2.1.3 or a tag manifest 173 Section 2.2.1. 175 payload The data encapsulated by the bag as a set of named files, 176 which may be organized in sub-directories. The contents of the 177 payload files are opaque to this document, and, with respect to 178 BagIt processing, are always considered as sequences of 179 uninterpreted octets. See Section 2.1.2. 181 tag directory A directory that contains one or more tag files. 183 tag file A file which contains metadata about the bag or its 184 payload. This document defines the standard BagIt tag files: the 185 bag declaration in "bagit.txt" Section 2.1.1, payload manifests 186 Section 2.1.3, tag manifests Section 2.2.1, bag metadata in "bag- 187 info.txt" Section 2.2.2, and remote payload in "fetch.txt" 188 Section 2.2.3. This document also allows other arbitrary tag 189 files as described in Section 2.2.4. 191 complete A bag which contains every element required by this 192 document, every payload file listed in a manifest, and any 193 optional files which are listed in a tag manifest. See Section 3. 195 valid A complete bag where every checksum in every manifest has been 196 successfully verified against the corresponding file. 198 2. Structure 200 A bag MUST consist of a base directory containing: 202 1. a set of required and optional tag files Section 2.2 204 2. a sub-directory named "data", called the payload directory. 205 Section 2.1.2 207 3. a set of optional tag directories 209 The tag files in the base directory consist of one or more files 210 named "manifest-_algorithm_.txt" (see Section 2.1.3 and Section 2.4), 211 a file named "bagit.txt" (see Section 2.1.1), and zero or more 212 additional tag files (see Section 2.2). The tag files and 213 directories are in arbitrary file hierarchies and MAY have any name 214 that is not reserved for a file or directory in this document. 216 The base directory can have any name. 218 / 219 | 220 +-- bagit.txt 221 | 222 +-- manifest-.txt 223 | 224 +-- [additional tag files] 225 | 226 +-- data/ 227 | | 228 | +-- [payload files] 229 | 230 +-- [tag directories]/ 231 | 232 +-- [tag files] 234 2.1. Required Elements 235 2.1.1. Bag Declaration: bagit.txt 237 The "bagit.txt" tag file MUST consist of exactly two lines in this 238 order: 240 BagIt-Version: M.N 241 Tag-File-Character-Encoding: ENCODING 243 _M.N_ identifies the BagIt major (M) and minor (N) version numbers. 244 _ENCODING_ identifies the character set encoding used by the 245 remaining tag files. _ENCODING_ SHOULD be "UTF-8" but for backwards 246 compatibility it MAY be any other encoding registered in 247 [cs-registry]. The bag declaration itself MUST be encoded in UTF-8, 248 and MUST NOT contain a byte-order mark (BOM) [RFC3629]. 250 The number for this version of BagIt is "1.0". 252 2.1.2. Payload Directory: data/ 254 The base directory MUST contain a sub-directory named "data". 256 The payload directory contains the arbitrary digital content within 257 the bag. The files under the payload directory are called payload 258 files, or the payload. Each payload file is treated as an opaque 259 octet stream when verifying file correctness. Payload files MAY be 260 organized in arbitrary sub-directory structures within the payload 261 directory, however for the purpose of this document such sub- 262 directory structures and filenames have no given meaning. 264 2.1.3. Payload Manifest: manifest-algorithm.txt 266 A payload manifest file provides a complete listing of each payload 267 file name along with a corresponding checksum to permit data 268 integrity checking. A bag can have more than one payload manifest, 269 with each using a different checksum algorithm. Manifest entries 270 MUST satisfy the following constraints: 272 o Every bag MUST contain at least one payload manifest file and MAY 273 contain more than one. 275 o Every payload manifest MUST list every payload file name exactly 276 once. 278 o A payload manifest file MUST have a name of the form "manifest- 279 _algorithm_.txt", where _algorithm_ is a string specifying the 280 checksum algorithm used by that manifest as described in 281 Section 2.4. 283 Example payload manifest filenames 285 manifest-sha256.txt 286 manifest-sha512.txt 288 Each line of a payload manifest file MUST be of the form: 290 checksum filepath 292 where _filepath_ is the pathname of a file relative to the base 293 directory, and _checksum_ is a hex-encoded checksum calculated 294 according to _algorithm_ over every octet in the file. 296 o The hex-encoded checksum MAY use uppercase and/or lowercase 297 letters. 299 o The slash character ('/') MUST be used as a path separator in 300 _filepath_. 302 o One or more linear whitespace characters (spaces or tabs) MUST 303 separate _checksum_ from _filepath_. 305 o There is no limitation on the length of a pathname. 307 o The payload manifest MUST NOT reference files outside the payload 308 directory. 310 o If a _filepath_ includes a line feed (LF), a carriage return (CR), 311 carriage return plus line feed (CRLF) or percent sign (%), those 312 characters (and only those) MUST be percent-encoded following 313 [RFC3986]. 315 A manifest MUST NOT reference directories. Bag creators who wish to 316 create an otherwise empty directory have typically done so by 317 creating an empty placeholder file with a name such as ".keep". 319 2.2. Optional Elements 321 2.2.1. Tag Manifest: tagmanifest-algorithm.txt 323 A tag manifest is a tag file that lists other tag files and checksums 324 for those tag files generated using a particular bag checksum 325 algorithm. 327 A bag MAY contain one or more tag manifests, in which case each tag 328 manifest SHOULD list the same set of tag files. 330 Each tag manifest MUST list every payload manifest. Each tag 331 manifest MUST NOT list any tag manifests, but SHOULD list the 332 remaining tag files present in the bag. 334 A tag manifest file MUST have a name of the form "tagmanifest- 335 _algorithm_.txt", where _algorithm_ is a string following the format 336 described in Section 2.4 specifying the bag checksum algorithm used 337 in that manifest. 339 Tag manifests SHOULD use the same algorithms as the payload manifests 340 that are present in the bag. 342 Example tag manifest filenames: 344 tagmanifest-sha256.txt 345 tagmanifest-sha512.txt 347 A tag manifest file has the same form as the payload file manifest 348 file described in Section 2.1.3, but MUST NOT list any payload files. 349 As a result, no _filepath_ listed in a tag manifest begins "data/". 351 2.2.2. Bag Metadata: bag-info.txt 353 The "bag-info.txt" file is a tag file that contains metadata elements 354 describing the bag and the payload. The metadata elements contained 355 in the "bag-info.txt" file are intended primarily for human use. All 356 metadata elements are OPTIONAL and MAY be repeated. Because "bag- 357 info.txt" is intended for human reading and editing, ordering MAY be 358 significant and the ordering of metadata elements MUST be preserved. 360 A metadata element MUST consist of a label, a colon ":", a single 361 linear whitespace character (space or tab), and a value, terminated 362 with a line feed (CR), carriage return (LF) or carriage return plus 363 line feed (CRLF). 365 The label MUST NOT contain colon (:), line feeds (LF) or carriage 366 returns (CR). The label MAY contain linear whitespace characters, 367 but MUST NOT start or end with whitespace. 369 It is RECOMMENDED that lines not exceed 79 characters in length. 370 Long values MAY be continued onto the next line by inserting a line 371 feed (LF), a carriage return (CR), or carriage return plus line feed 372 (CRLF) and indenting the next line with one or more linear white 373 space (spaces or tabs). Except for linebreaks such padding does not 374 form part of the value. 376 Implementations wishing to support previous BagIt versions MUST 377 accept multiple linear whitespace before and after the colon when the 378 bag version is earlier than 1.0; such whitespace does not form part 379 of the label or value. 381 The following are reserved metadata elements. The use of these 382 reserved metadata elements are OPTIONAL but encouraged. Reserved 383 metadata element names are case-insensitive. Except where indicated 384 otherwise, these metadata element names MAY be repeated to capture 385 multiple values. 387 Source-Organization Organization transferring the content. 389 Organization-Address Mailing address of the source organization. 391 Contact-Name Person at the source organization who is responsible 392 for the content transfer. 394 Contact-Phone International format telephone number of person or 395 position responsible. 397 Contact-Email Fully qualified email address of person or position 398 responsible. 400 External-Description A brief explanation of the contents and 401 provenance. 403 Bagging-Date Date (YYYY-MM-DD) that the content was prepared for 404 transfer. This metadata element SHOULD NOT be repeated. 406 External-Identifier A sender-supplied identifier for the bag. 408 Bag-Size Size or approximate size of the bag being transferred, 409 followed by an abbreviation such as MB (megabytes), GB, or TB; for 410 example, 42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum 411 (described next), Bag-Size is intended for human consumption. 412 This metadata element SHOULD NOT be repeated. 414 Payload-Oxum The "octetstream sum" of the payload, intended for the 415 purpose of quickly detecting incomplete bags before performing 416 checksum validation. This is strictly an optimization and 417 implementations MUST perform the standard checksum validation 418 process before proclaiming a bag to be valid. This element MUST 419 NOT be present more than once and, if present, MUST be in the form 420 "_OctetCount_._StreamCount_", where _OctetCount_ is the total 421 number of octets (8-bit bytes) across all payload file content and 422 _StreamCount_ is the total number of payload files. This metadata 423 element MUST NOT be repeated. 425 Bag-Group-Identifier A sender-supplied identifier for the set, if 426 any, of bags to which it logically belongs. This identifier 427 SHOULD be unique across the sender's content, and if recognizable 428 as belonging to a globally unique scheme, the receiver SHOULD make 429 an effort to honor reference to it. This metadata element SHOULD 430 NOT be repeated. 432 Bag-Count Two numbers separated by "of", in particular, "N of T", 433 where T is the total number of bags in a group of bags and N is 434 the ordinal number within the group; if T is not known, specify it 435 as "?" (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 436 145. This metadata element SHOULD NOT be repeated. If this 437 metadata element is present, it is RECOMMENDED to also include the 438 Bag-Group-Identifier element. 440 Internal-Sender-Identifier An alternate sender-specific identifier 441 for the content and/or bag. 443 Internal-Sender-Description A sender-local explanation of the 444 contents and provenance. 446 In addition to these metadata elements, other arbitrary metadata 447 elements MAY also be present. 449 An example "bag-info.txt" file 451 Source-Organization: FOO University 452 Organization-Address: 1 Main St., Cupertino, California, 11111 453 Contact-Name: Jane Doe 454 Contact-Phone: +1 111-111-1111 455 Contact-Email: example@example.com 456 External-Description: Uncompressed greyscale TIFF images from the 457 FOO papers colle... 458 Bagging-Date: 2008-01-15 459 External-Identifier: university_foo_001 460 Payload-Oxum: 279164409832.1198 461 Bag-Group-Identifier: university_foo 462 Bag-Count: 1 of 15 463 Internal-Sender-Identifier: /storage/images/foo 464 Internal-Sender-Description: Uncompressed greyscale TIFFs created 465 from microfilm and are... 467 2.2.3. Fetch File: fetch.txt 469 For reasons of efficiency, a bag MAY be sent with a list of files to 470 be fetched and added to the payload before it can meaningfully be 471 checked for completeness. The fetch file allows a bag to be 472 transmitted with "holes" in it, which can be practical for several 473 reasons. For example, it obviates the need for the sender to stage a 474 large serialized copy of the content while the bag is transferred to 475 the receiver. Also, this method allows a sender to construct a bag 476 from components that are either a subset of logically related 477 components (e.g., the localized logical object could be much larger 478 than what is intended for export) or assembled from logically 479 distributed sources (e.g., the object components for export are not 480 stored locally under one filesystem tree). An OPTIONAL tag file 481 called the fetch file contains such a list. 483 The fetch file MUST be named "fetch.txt". Every file listed in the 484 fetch file MUST be listed in every payload manifest. A fetch file 485 MUST NOT list any tag files. 487 Each line of a fetch file MUST be of the form: 489 url length filepath 491 where _url_ identifies the file to be fetched and MUST be an absolute 492 URI as defined in [RFC3986], _length_ is the number of octets in the 493 file (or "-", to leave it unspecified), and _filepath_ identifies the 494 corresponding payload file, relative to the base directory. 496 The slash character ('/') MUST be used as a path separator in 497 _filepath_. One or more linear whitespace characters (spaces or tabs) 498 MUST separate these three values, and any such characters in the 499 _url_ MUST be percent-encoded [RFC3986]. If _filename_ includes a 500 line feed (LF), a carriage return (CR), carriage return plus line 501 feed (CRLF) or percent sign (%), those characters (and only those) 502 MUST be percent-encoded following [RFC3986]. There is no limitation 503 on the length of any of the fields in the fetch file. 505 2.2.4. Other Tag Files 507 A bag MAY contain other tag files that are not defined by this 508 document. Implementations MUST perform standard checksum validation 509 on any tag file which is listed in a tag manifest but MUST otherwise 510 ignore their contents. 512 2.3. Text Tag File Format 514 All tag files specifically described in this document MUST adhere to 515 the text tag file format described below. Other tag files MAY adhere 516 to the text tag file format described below. 518 Text tag files are line-oriented, and each line MUST be terminated by 519 a line feed (LF), a carriage return (CR), or carriage return plus 520 newline (CRLF). It is RECOMMENDED that the last line in a tag file 521 also ends with LF, CR, or CRLF. Text tag file names MUST end in the 522 extension ".txt". 524 In all text tag files except for the bag declaration file, text MUST 525 use the character encoding specified in the "bagit.txt" bag 526 declaration file. Text tag files except for the bag declaration file 527 MAY include a byte-order mark (BOM) only if the specified encoding 528 requires it for proper decoding. In accordance with [RFC3629], when 529 "bagit.txt" specifies UTF-8 the tag files MUST NOT begin with a byte- 530 order mark (BOM). See Section 2.1.1 532 The use of UTF-8 for text tag files is strongly RECOMMENDED. A 533 future version of BagIt may disallow encodings other than UTF-8. 535 2.4. Bag Checksum Algorithms 537 The payload manifest and tag manifests permit validating the 538 integrity of the payload and tag files in a bag produced by the 539 checksum algorithms. Checksum values MUST be encoded so as to 540 conform to the manifest format specified in Section 2.1.3. However, 541 the internal details of a checksum are outside the scope of this 542 document. 544 To avoid future ambiguity, the checksum algorithm SHOULD be 545 registered in IANA's "Named Information Hash Algorithm Registry" 546 [ni-registry] according to [RFC6920], but MAY for backwards 547 compatibility also be MD5 [RFC1321] or SHA-1 [RFC3174]. 549 The name of the checksum algorithm MUST be normalized for use in the 550 manifest's filename by lowercasing the common name of the algorithm 551 and removing all non-alphanumeric characters. Following is a partial 552 list mapping common algorithm names to normalized names: 554 o MD5: md5 556 o SHA-1: sha1 558 o sha-256: sha256 560 o sha-512: sha512 562 Starting with BagIt 1.0, bag creation and validation tools MUST 563 support the SHA-256 and SHA-512 algorithms [RFC6234] and SHOULD 564 enable SHA-512 by default when creating new bags. For backwards 565 compatibility implementers SHOULD support MD5 [RFC1321] and SHA-1 566 [RFC3174]. Implementers are encouraged to simplify the process of 567 adding additional manifests using new algorithms to streamline the 568 process of in-place upgrades. 570 3. Complete and Valid bags 572 A _complete_ bag MUST meet the following requirements: 574 1. Every required element MUST be present (Section 2.1). 576 2. Every file listed in every tag manifest MUST be present. 578 3. Every file listed in every payload manifest MUST be present. 580 4. For BagIt 1.0, every payload file MUST be listed in every payload 581 manifest. Note that older versions of BagIt allowed payload 582 files to be listed in just one of the manifests. 584 5. Every element present MUST conform to BagIt 1.0. 586 A _valid_ bag MUST meet the following requirements: 588 1. The bag MUST be _complete_. 590 2. Every checksum in every payload manifest and tag manifest has 591 been successfully verified against the contents of the 592 corresponding file. 594 4. Examples 596 4.1. Example of a basic bag 598 This is the layout of a basic bag containing an image and a companion 599 OCR file. Lines of file content are shown with added parentheses to 600 indicate each complete line. For brevity this example uses MD5 601 rather than the recommended SHA-512. 603 myfirstbag/ 604 | 605 | manifest-md5.txt 606 | (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png) 607 | (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt) 608 | 609 | bagit.txt 610 | (BagIt-version: 1.0 ) 611 | (Tag-File-Character-Encoding: UTF-8 ) 612 | 613 \--- data/ 614 | 615 | 27613-h/images/q172.png 616 | (... image bytes ... ) 617 | 618 | 27613-h/images/q172.txt 619 | (... OCR text ... ) 620 .... 622 4.2. Example bag using fetch.txt 624 This is the layout of a bag which expects the receiver to download 625 the files listed in the payload manifests prior to validation. Lines 626 of file content are shown with added parentheses to indicate each 627 complete line. For brevity this example uses MD5 rather than the 628 recommended SHA-512. 630 highsmith-tahoe/ 631 | 632 | manifest-md5.txt 633 | (102b0e6effe208ef9b29864946de9e22 data/23364a.tif ) 634 | 635 | fetch.txt 636 | (https://cdn.loc.gov/master/pnp/highsm/23300/23364a.tif 637 | 216951362 data/23364a.tif ) 638 | 639 | bagit.txt 640 | (BagIt-version: 1.0 ) 641 | (Tag-File-Character-Encoding: UTF-8 ) 642 | 643 | bag-info.txt 644 | (Internal-Sender-Description: Download link found at ) 645 | ( https://www.loc.gov/resource/highsm.23364/ ) 647 5. Security Considerations 649 5.1. Special directory characters 651 The paths specified in the payload manifests, tag manifests, and 652 fetch files do not prohibit special directory characters which have 653 special meaning on some operating systems. Implementers MUST ensure 654 that files outside the bag directory structure are not accessed when 655 reading or writing files based on paths specified in a bag. 657 All implementations SHOULD have a test suite to guard against special 658 directory characters. 660 For example, a maliciously crafted "tagmanifest-sha512.txt" file 661 might contain entries which begin with a path character such as "/", 662 "..", or a "~username" home directory reference in an attempt to 663 cause a naive implementation to leak or overwrite targeted files on a 664 POSIX operating system. 666 Windows implementations SHOULD test their implementations to ensure 667 that safety-checks prevent use of drive letters and the less commonly 668 used namespace sequences (e.g. "\\?\C:\...") described in [MSFNAM]. 670 To assist implementers, the Library of Congress conformance suite 671 [LC-CONFORMANCE-SUITE] has some tests for invalid bags which are 672 expected to fail on POSIX or Windows clients. 674 5.2. Control of URLs in fetch.txt 676 Implementers of tools that complete bags by retrieving URLs listed in 677 a fetch file need to be aware that some of those URLs might point to 678 hosts, intentionally or unintentionally, that are not under control 679 of the bag's sender. Moreover, older checksum algorithms, even if 680 reasonable for detecting corruption during transit, may not offer 681 strong cryptographic protection against intentional spoofing. 683 5.3. File sizes in fetch.txt 685 The size of files, as optionally reported in the fetch file, cannot 686 be guaranteed to match the actual file size to be downloaded. 687 Implementers SHOULD take steps to monitor and abort transfer when the 688 received file size exceeds the file size reported in the fetch file. 689 Implementers SHOULD NOT use the file size in the fetch file for 690 critical resource allocation, such as buffer sizing or storage 691 requisitioning. 693 5.4. Attacks on payload file content 695 The integrity assurance provided by manifests is designed to provide 696 high levels of confidence against data corruption but is not designed 697 to be secure against active attacks. Organizations that need to 698 secure bags against such threats SHOULD agree on additional measures, 699 such as digital signatures, that are out of scope for this 700 specification. 702 6. Practical Considerations (non-normative) 704 6.1. Interoperability 706 This section lists practical considerations for implementers and 707 users. None of the points below are required but they are 708 recommended for general-purpose usage. 710 Upon discovering errors in bags, an implementation is free to take 711 action (for example, logging or reporting) in an application-specific 712 manner. This document does not mandate any particular action. 714 The Library of Congress conformance suite [LC-CONFORMANCE-SUITE] is 715 provided as a public resource to test new implementations for 716 compatibility and error handling. 718 6.1.1. Filename normalization 720 This section provides background information on various challenges 721 caused by differences in how operating systems, filesystems, and 722 common tools handle filenames followed by a list of recommendations 723 for implementers in Section 6.1.1.3. 725 6.1.1.1. Case sensitivity 727 There are two challenges for interoperability related to filename 728 case: 730 o Filesystems such as FAT or EXFAT always convert filenames to 731 uppercase: "example.txt" will be stored as "EXAMPLE.TXT" 733 o Many Unix filesystems save filenames exactly as provided, allowing 734 multiple files which differ only in case: "example.txt" and 735 "Example.txt" are separate files 737 o NTFS and Apple's HFS Plus usually preserve case when storing files 738 but are case-insensitive when retrieving them. A file saved as 739 "Example.txt" will be retrieved by that name but will also be 740 retrieved as "EXAMPLE.TXT", "example.txt", etc. 742 6.1.1.2. Unicode normalization 744 The Unicode specification has common cases where different character 745 sequences produce the same human-meaningful text. These are referred 746 to as "canonically equivalent" and the Unicode specification defines 747 different normalization forms -- see [UNICODE-TR15] for the full 748 details and a brief example below: 750 The common surname "Nunez" normalized in different forms 752 Normalization Form D (Decomposition): 754 Char UTF8 Hex Name 755 ---------------------------------------------- 756 N 4e LATIN CAPITAL LETTER N 757 u 75 LATIN SMALL LETTER U 758 \u0301 cc81 COMBINING ACUTE ACCENT 759 n 6e LATIN SMALL LETTER N 760 \u0303 cc83 COMBINING TILDE 761 e 65 LATIN SMALL LETTER E 762 z 7a LATIN SMALL LETTER Z 764 Normalization Form C (Canonical Composition): 766 Char UTF8 Hex Name 767 ---------------------------------------------- 768 N 4e LATIN CAPITAL LETTER N 769 u c3ba LATIN SMALL LETTER U WITH ACUTE 770 n c3b1 LATIN SMALL LETTER N WITH TILDE 771 e 65 LATIN SMALL LETTER E 772 z 7a LATIN SMALL LETTER Z 774 Unicode normalization is relevant to BagIt implementors because 775 different systems have different standards for normalization: 777 o Apple's HFS Plus filesystem always normalizes filenames to a 778 fully-decomposed form based on the Unicode 2.0 specification (see 779 [TN1150]). 781 o Windows treats filenames as opaque character sequences (see 782 [MSFNAM]) and will store and return the encoded bytes exactly as 783 provided. 785 o Linux and other common Unix systems are generally similar to 786 Windows in storing and returning opaque byte streams but this 787 behaviour is technically filesystem-dependent. 789 o Utilities used for file management, transfer, and archival may 790 ignore this issue, apply an arbitrary normalization form, or allow 791 the user to control how normalization is applied. 793 In practice, this means that the encoded filename stored in a 794 manifest may fail a simple file existence check because the 795 filename's normalization was changed at some point after the manifest 796 was written. This situation is very confusing for users because the 797 filenames are visually indistinguishable and the "missing" file is 798 obviously present in the payload directory. 800 6.1.1.3. Recommendations 802 o Implementations SHOULD discourage the creation of bags containing 803 files which differ only in case. 805 o Implementations SHOULD prevent the creation of bags containing 806 files which differ only in normalization form. 808 o BagIt implementations SHOULD tolerate differences in normalization 809 form by comparing both the list of filesystem and manifest names 810 after applying the same normalization form to both. 812 o Implementations SHOULD issue a warning when multiple manifests are 813 present which differ only in case or normalization form. 815 6.1.2. Windows and Unix file naming 817 As specified above, only the Unix-based path separator ('/') may be 818 used inside filenames listed in BagIt manifest and fetch.txt files. 819 When bags are exchanged between Windows and Unix platforms, the path 820 separator SHOULD be translated as needed. Receivers of bags on 821 physical media SHOULD be prepared for filesystems created under 822 either Windows or Unix. Besides the fundamental difference between 823 path separators ('\' and '/'), generally, Windows filesystems have 824 more limitations than Unix filesystems. 826 Windows path names have a maximum of 255 characters, and none of 827 these characters may be used in a path component: 829 < > : " / | ? * 831 Windows also reserves the following names, with or without a file 832 extension: 834 CON, PRN, AUX, NUL 835 COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9 836 LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9 838 See [MSFNAM] for more information and possible alternatives. 840 6.1.3. Legacy checksum tools 842 Some bags have been manually assembled using checksum utilities such 843 as those contained in the GNU Coreutils package (md5sum, sha1sum, 844 etc.), collectively referred to here as "md5sum". Implementers who 845 desire wide support of legacy content should be aware of some known 846 quirks of these tools: 848 md5sum can be run in "text mode" which causes it to normalize line- 849 endings on some operating systems. On Unix-like systems both modes 850 will usually produce the same results but on systems like Windows 851 they can produce different results based on the file contents. The 852 md5sum output format has two characters between the checksum and the 853 filepath: the first is always a space and the second is an asterisk 854 ("*") for binary mode and a space for text mode. 856 A final note about md5sum-generated manifests is that for a 857 _filepath_ containing a backslash ('\'), the manifest line will have 858 a backslash inserted in front of the _checksum_ and, under Windows, 859 the backslashes inside _filepath_ can be doubled. 861 Implementers MAY wish to accept this format by ignoring a leading 862 asterisk or handling differences in line termination gracefully but, 863 if so, implementations MUST warn the user that the bag in question 864 will fail strict validation. In such cases it is RECOMMENDED that 865 tools provide an easy option to update the bag with valid manifests. 867 7. Augmented Backus-Naur Form (non-normative) 869 The Augmented Backus-Naur Form (ABNF) rules provided below are non- 870 normative. If there is a discrepancy between requirements in the 871 normative sections and the ABNF, the requirements in the normative 872 sections prevail. Some definitions use the core rules (e.g. DIGIT, 873 HEXDIG, etc) as defined in [RFC4234] 875 7.1. Bag Declaration: bagit.txt 877 bagit.txt ABNF rules: 879 bagit-txt = "BagIt-Version: " 1*DIGIT "." 1*DIGIT ending 880 "Tag-File-Character-Encoding: " encoding ending 881 encoding = 1*CHAR 882 ending = CR / LF / CRLF 884 7.2. Payload Manifest: manifest-algorithm.txt 886 Payload Manifest ABNF rules: 888 payload-manifest = 1*payload-manifest-line 889 payload-manifest-line = checksum 1*WSP filepath ending 890 checksum = 1*case-hexdig 891 case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / 892 "D" / "d" / "E"/ "e"/ "F" / "f" 893 filepath = "data/" 894 1*( unreserved / pct-encoded / sub-delims ) 895 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" 896 sub-delims = "!" / "$" / "&" / DQUOTE / "'" / "(" / ")" / 897 "*" / "+" / "," / ";" / "=" / "/" 898 pct-encoded = "%0D" / "%0d" / "%0A" / "%0a" / "%25" 899 ending = CR / LF / CRLF 901 7.3. Bag Metadata: bag-info.txt 903 bag-info.txt ABNF rules: 905 metadata = 1*metadata-line 906 metadata-line = key ":" WSP value ending *(continuation ending) 907 key = 1*non-reserved 908 value = 1*non-reserved 909 continuation = WSP 1*non-reserved 910 non-reserved = VCHAR / WSP 911 ; any valid character for the specific encoding 912 ; except those that match "ending" 913 ending = CR / LF / CRLF 915 7.4. Fetch File: fetch.txt 917 fetch.txt ABNF rules: 919 fetch = 1*fetch-line 920 fetch-line = url 1*WSP length 1*WSP filepath ending 921 url = 922 length = 1*DIGIT / "-" 923 filepath = ("data/" 924 1*( unreserved / pct-encoded / sub-delims )) 925 ending = CR / LF / CRLF 927 8. Contributors 929 Additional contributors to the authoring of BagIt are Andy Boyko, 930 David Brunton, Rosie Storey, Ed Summers, Brian Vargas, and Kate 931 Zwaard. 933 9. Acknowledgements 935 BagIt benefitted from the thoughtful assistance of Stephen Abrams, 936 Mike Ashenfelder, Dan Chudnov, Dave Crocker, Scott Fisher, Brad 937 Hards, Erik Hetzner, Keith Johnson, Leslie Johnston, David Loy, Mark 938 Phillips, Tracy Seneca, Stian Soiland-Reyes, Brian Tingle, Adam 939 Turoff, and Jim Tuttle. 941 10. IANA Considerations 943 This draft does not request any action from IANA. 945 11. References 947 11.1. Normative References 949 [cs-registry] 950 IANA, "Character Set Registry", 12 2013, 951 . 954 [ni-registry] 955 IANA, "Named Information Hash Algorithm Registry", 9 2016, 956 . 959 [RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321, 960 DOI 10.17487/RFC1321, April 1992, 961 . 963 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 964 Requirement Levels", BCP 14, RFC 2119, 965 DOI 10.17487/RFC2119, March 1997, 966 . 968 [RFC3174] Eastlake 3rd, D. and P. Jones, "US Secure Hash Algorithm 1 969 (SHA1)", RFC 3174, DOI 10.17487/RFC3174, September 2001, 970 . 972 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 973 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 974 2003, . 976 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 977 Resource Identifier (URI): Generic Syntax", STD 66, 978 RFC 3986, DOI 10.17487/RFC3986, January 2005, 979 . 981 [RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms 982 (SHA and SHA-based HMAC and HKDF)", RFC 6234, 983 DOI 10.17487/RFC6234, May 2011, 984 . 986 [RFC6920] Farrell, S., Kutscher, D., Dannewitz, C., Ohlman, B., 987 Keranen, A., and P. Hallam-Baker, "Naming Things with 988 Hashes", RFC 6920, DOI 10.17487/RFC6920, April 2013, 989 . 991 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 992 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 993 May 2017, . 995 11.2. Informative References 997 [ENCDEP] Tabata, K., "A Collaboration Model between Archival 998 Systems to Enhance the Reliability of Preservation by an 999 Enclose-and-Deposit Method", 2005, 1000 . 1002 [LC-CONFORMANCE-SUITE] 1003 The Library of Congress, "BagIt Conformance Suite", 2016-, 1004 . 1007 [MSFNAM] Microsoft, Inc., "Naming a File", 2008, 1008 . 1010 [RFC4234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax 1011 Specifications: ABNF", RFC 4234, DOI 10.17487/RFC4234, 1012 October 2005, . 1014 [TN1150] Apple Inc., "Technical Note TN1150: HFS Plus Volume 1015 Format", 3 2004, 1016 . 1019 [UNICODE-TR15] 1020 Unicode Consortium, "Unicode(R) Standard Annex #15: 1021 Unicode Normalization Forms", 2 2016, 1022 . 1024 Authors' Addresses 1025 John A. Kunze 1026 California Digital Library 1027 415 20th St, 4th Floor 1028 Oakland, CA 94612 1029 US 1031 Email: jak@ucop.edu 1033 Justin Littman 1034 Stanford Libraries 1035 518 Memorial Way 1036 Stanford, CA 94305 1037 USA 1039 Email: justinlittman@stanford.edu 1041 Liz Madden 1042 Library of Congress 1043 101 Independence Avenue SE 1044 Washington, DC 20540 1045 USA 1047 Email: emad@loc.gov 1049 John Scancella 1051 Email: john.scancella@gmail.com 1053 Chris Adams 1054 Library of Congress 1055 101 Independence Avenue SE 1056 Washington, DC 20540 1057 USA 1059 Email: cadams@loc.gov