idnits 2.17.1 draft-kunze-bagit-16.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 4, 2018) is 2153 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 4234 (Obsoleted by RFC 5234) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Kunze 3 Internet-Draft California Digital Library 4 Intended status: Informational J. Littman 5 Expires: December 6, 2018 George Washington University Libraries 6 E. Madden 7 J. Scancella 8 C. Adams 9 Library of Congress 10 June 4, 2018 12 The BagIt File Packaging Format (V1.0) 13 draft-kunze-bagit-16 15 Abstract 17 This document describes BagIt, a set of hierarchical file layout 18 conventions for storage and transfer of arbitrary digital content. A 19 "bag" has just enough structure to enclose descriptive metadata 20 "tags" and a file "payload" but does not require knowledge of the 21 payload's internal semantics. This BagIt format is suitable for 22 reliable storage and transfer. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at https://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on December 6, 2018. 41 Copyright Notice 43 Copyright (c) 2018 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (https://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 59 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 3 60 1.2. Requirements . . . . . . . . . . . . . . . . . . . . . . 3 61 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 62 2. Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 5 63 2.1. Required Elements . . . . . . . . . . . . . . . . . . . . 5 64 2.1.1. Bag Declaration: bagit.txt . . . . . . . . . . . . . 5 65 2.1.2. Payload Directory: data/ . . . . . . . . . . . . . . 6 66 2.1.3. Payload Manifest: manifest-algorithm.txt . . . . . . 6 67 2.2. Optional Elements . . . . . . . . . . . . . . . . . . . . 7 68 2.2.1. Tag Manifest: tagmanifest-algorithm.txt . . . . . . . 7 69 2.2.2. Bag Metadata: bag-info.txt . . . . . . . . . . . . . 8 70 2.2.3. Fetch File: fetch.txt . . . . . . . . . . . . . . . . 10 71 2.2.4. Other Tag Files . . . . . . . . . . . . . . . . . . . 11 72 2.3. Text Tag File Format . . . . . . . . . . . . . . . . . . 11 73 2.4. Bag Checksum Algorithms . . . . . . . . . . . . . . . . . 12 74 3. Complete and Valid bags . . . . . . . . . . . . . . . . . . . 12 75 4. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 13 76 4.1. Example of a basic bag . . . . . . . . . . . . . . . . . 13 77 4.2. Example bag using fetch.txt . . . . . . . . . . . . . . . 14 78 5. Security Considerations . . . . . . . . . . . . . . . . . . . 14 79 5.1. Special directory characters . . . . . . . . . . . . . . 14 80 5.2. Control of URLs in fetch.txt . . . . . . . . . . . . . . 15 81 5.3. File sizes in fetch.txt . . . . . . . . . . . . . . . . . 15 82 6. Practical Considerations (non-normative) . . . . . . . . . . 15 83 6.1. Interoperability . . . . . . . . . . . . . . . . . . . . 15 84 6.1.1. Filename normalization . . . . . . . . . . . . . . . 15 85 6.1.2. Windows and Unix file naming . . . . . . . . . . . . 17 86 6.1.3. Legacy checksum tools . . . . . . . . . . . . . . . . 18 87 7. Augmented Backus-Naur Form (non-normative) . . . . . . . . . 19 88 7.1. Bag Declaration: bagit.txt . . . . . . . . . . . . . . . 19 89 7.2. Payload Manifest: manifest-algorithm.txt . . . . . . . . 19 90 7.3. Bag Metadata: bag-info.txt . . . . . . . . . . . . . . . 19 91 7.4. Fetch File: fetch.txt . . . . . . . . . . . . . . . . . . 20 92 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 20 93 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 20 94 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20 95 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 20 96 11.1. Normative References . . . . . . . . . . . . . . . . . . 20 97 11.2. Informative References . . . . . . . . . . . . . . . . . 21 98 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 100 1. Introduction 102 1.1. Purpose 104 BagIt is a set of hierarchical file layout conventions designed to 105 support storage and transfer of arbitrary digital content. A bag 106 consists of a directory containing the payload files and other 107 accompanying metadata files known as "tag" files. The "tags" are 108 metadata files intended to facilitate and document the storage and 109 transfer of the bag. Processing a bag does not require any 110 understanding of the payload file contents and the payload files can 111 be accessed without processing the BagIt metadata. 113 The name, BagIt, is inspired by the "enclose and deposit" method 114 [ENCDEP], sometimes referred to as "bag it and tag it". BagIt 115 differs from serialized archive formats such as MIME, TAR, or ZIP in 116 two general areas: 118 1. Strong integrity assurances. The format supports cryptographic- 119 quality hash algorithms (see Section 2.4) and allows for in-place 120 upgrades to add additional manifests using stronger algorithms 121 without breaking backwards compatibility. 123 2. Direct file access. Because BagIt specifies an actual filesystem 124 hierarchy rather than a serialized representation of one, files 125 can be accessed using standard operating system utilities, 126 implementations do not need to process a potentially large 127 archive file to extract a subset of data, and the format imposes 128 no size limits for either individual files or a bag. 130 BagIt is widely used for preserving digital assets originating from 131 different domains. Organizations involved in digital preservation 132 with BagIt include the Library of Congress, Dryad Data Repository, 133 NSF DataONE, and the Rockefeller Archive Center. Software 134 implementations are available for many languages including Python, 135 Ruby, Java, Perl, and PHP. It is also used in the libraries of many 136 universities, such as Cornell, Purdue, Stanford, Ghent University, 137 New York University, and the University of California. 139 1.2. Requirements 141 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 142 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 143 "OPTIONAL" in this document are to be interpreted as described in BCP 144 14 [RFC2119] [RFC8174] when, and only when, they appear in all 145 capitals as shown here. 147 Implementers are strongly encouraged to review the interoperability 148 considerations described in Section 6.1. 150 1.3. Terminology 152 The following terms have precise definitions as used in this 153 document: 155 bag A set of opaque files contained within the structure defined by 156 this document. 158 bag declaration The file required to be in all bags conforming to 159 this document. Contains values necessary to process the rest of a 160 bag. See Section 2.1.1. 162 bag checksum algorithm The name of a cryptographic checksum 163 algorithm which has been normalized for use in a manifest or tag 164 manifest file name (e.g. "sha512") as described in Section 2.4. 166 manifest A tag file thats maps filepaths to checksums. A manifest 167 can be a payload manifest Section 2.1.3 or a tag manifest 168 Section 2.2.1. 170 payload The data encapsulated by the bag as a set of named files, 171 which may be organized in sub-directories. The contents of the 172 payload files are opaque to this document, and, with respect to 173 BagIt processing, are always considered as sequences of 174 uninterpreted octets. See Section 2.1.2. 176 tag directory A directory that contains one or more tag files. 178 tag file A file which contains metadata about the bag or its 179 payload. This document defines the standard BagIt tag files: the 180 bag declaration in "bagit.txt" Section 2.1.1, payload manifests 181 Section 2.1.3, tag manifests Section 2.2.1, bag metadata in "bag- 182 info.txt" Section 2.2.2, and remote payload in "fetch.txt" 183 Section 2.2.3. This document also allows other arbitrary tag 184 files as described in Section 2.2.4. 186 complete A bag which contains every element required by this 187 document, every payload file listed in a manifest, and any 188 optional files which are listed in a tag manifest. See Section 3. 190 valid A complete bag where every checksum in every manifest has been 191 successfully verified against the corresponding file. 193 2. Structure 195 A bag MUST consist of a base directory containing: 197 1. a set of required and optional tag files Section 2.2 199 2. a sub-directory named "data", called the payload directory. 200 Section 2.1.2 202 3. a set of optional tag directories 204 The tag files in the base directory consist of one or more files 205 named "manifest-_algorithm_.txt" (see Section 2.1.3 and Section 2.4), 206 a file named "bagit.txt" (see Section 2.1.1), and zero or more 207 additional tag files (see Section 2.2). The tag files and 208 directories are in arbitrary file hierarchies and MAY have any name 209 that is not reserved for a file or directory in this document. 211 The base directory can have any name. 213 / 214 | 215 +-- bagit.txt 216 | 217 +-- manifest-.txt 218 | 219 +-- [additional tag files] 220 | 221 +-- data/ 222 | | 223 | +-- [payload files] 224 | 225 +-- [tag directories]/ 226 | 227 +-- [tag files] 229 2.1. Required Elements 231 2.1.1. Bag Declaration: bagit.txt 233 The "bagit.txt" tag file MUST consist of exactly two lines in this 234 order: 236 BagIt-Version: M.N 237 Tag-File-Character-Encoding: ENCODING 239 _M.N_ identifies the BagIt major (M) and minor (N) version numbers. 240 _ENCODING_ identifies the character set encoding used by the 241 remaining tag files. _ENCODING_ SHOULD be "UTF-8" but for backwards 242 compatibility it MAY be any other encoding registered in [RFC2978]. 243 The bag declaration itself MUST be encoded in UTF-8, and MUST NOT 244 contain a byte-order mark (BOM) [RFC3629]. 246 The number for this version of BagIt is "1.0". 248 2.1.2. Payload Directory: data/ 250 The base directory MUST contain a sub-directory named "data". 252 The payload directory contains the arbitrary digital content within 253 the bag. The files under the payload directory are called payload 254 files, or the payload. Each payload file is treated as an opaque 255 octet stream when verifying file correctness. Payload files MAY be 256 organized in arbitrary sub-directory structures within the payload 257 directory, however for the purpose of this document such sub- 258 directory structures and filenames have no given meaning. 260 2.1.3. Payload Manifest: manifest-algorithm.txt 262 A payload manifest file provides a complete listing of each payload 263 file name along with a corresponding checksum to permit data 264 integrity checking. A bag can have more than one payload manifest, 265 with each using a different checksum algorithm. Manifest entries 266 MUST satisfy the following constraints: 268 o Every bag MUST contain at least one payload manifest file and MAY 269 contain more than one. 271 o Every payload manifest MUST list every payload file name exactly 272 once. 274 o A payload manifest file MUST have a name of the form "manifest- 275 _algorithm_.txt", where _algorithm_ is a string specifying the 276 checksum algorithm used by that manifest as described in 277 Section 2.4. 279 Example payload manifest filenames 281 manifest-sha256.txt 282 manifest-sha512.txt 283 Each line of a payload manifest file MUST be of the form: 285 checksum filepath 287 where _filepath_ is the pathname of a file relative to the base 288 directory, and _checksum_ is a hex-encoded checksum calculated 289 according to _algorithm_ over every octet in the file. 291 o The hex-encoded checksum MAY use uppercase and/or lowercase 292 letters. 294 o The slash character ('/') MUST be used as a path separator in 295 _filepath_. 297 o One or more linear whitespace characters (spaces or tabs) MUST 298 separate _checksum_ from _filepath_. 300 o There is no limitation on the length of a pathname. 302 o The payload manifest MUST NOT reference files outside the payload 303 directory. 305 o If a _filepath_ includes a line feed (LF), a carriage return (CR), 306 carriage return plus line feed (CRLF) or percent sign (%), those 307 characters (and only those) MUST be percent-encoded following 308 [RFC3986]. 310 A manifest MUST NOT reference directories. Bag creators who wish to 311 create an otherwise empty directory have typically done so by 312 creating an empty placeholder file with a name such as ".keep". 314 2.2. Optional Elements 316 2.2.1. Tag Manifest: tagmanifest-algorithm.txt 318 A tag manifest is a tag file that lists other tag files and checksums 319 for those tag files generated using a particular bag checksum 320 algorithm. 322 A bag MAY contain one or more tag manifests, in which case each tag 323 manifest SHOULD list the same set of tag files. 325 Each tag manifest MUST list every payload manifest. Each tag 326 manifest MUST NOT list any tag manifests, but SHOULD list the 327 remaining tag files present in the bag. 329 A tag manifest file MUST have a name of the form "tagmanifest- 330 _algorithm_.txt", where _algorithm_ is a string following the format 331 described in Section 2.4 specifying the bag checksum algorithm used 332 in that manifest. 334 Tag manifests SHOULD use the same algorithms as the payload manifests 335 that are present in the bag. 337 Example tag manifest filenames: 339 tagmanifest-sha256.txt 340 tagmanifest-sha512.txt 342 A tag manifest file has the same form as the payload file manifest 343 file described in Section 2.1.3, but MUST NOT list any payload files. 344 As a result, no _filepath_ listed in a tag manifest begins "data/". 346 2.2.2. Bag Metadata: bag-info.txt 348 The "bag-info.txt" file is a tag file that contains metadata elements 349 describing the bag and the payload. The metadata elements contained 350 in the "bag-info.txt" file are intended primarily for human use. All 351 metadata elements are OPTIONAL and MAY be repeated. Because "bag- 352 info.txt" is intended for human reading and editing, ordering MAY be 353 significant and the ordering of metadata elements MUST be preserved. 355 A metadata element MUST consist of a label, a colon ":", a single 356 linear whitespace character (space or tab), and a value, terminated 357 with a line feed (CR), carriage return (LF) or carriage return plus 358 line feed (CRLF). 360 The label MUST NOT contain colon (:), line feeds (LF) or carriage 361 returns (CR). The label MAY contain linear whitespace characters, 362 but MUST NOT start or end with whitespace. 364 It is RECOMMENDED that lines not exceed 79 characters in length. 365 Long values MAY be continued onto the next line by inserting a line 366 feed (LF), a carriage return (CR), or carriage return plus line feed 367 (CRLF) and indenting the next line with one or more linear white 368 space (spaces or tabs). Except for linebreaks such padding does not 369 form part of the value. 371 Implementations wishing to support previous BagIt versions MUST 372 accept multiple linear whitespace before and after the colon when the 373 bag version is earlier than 1.0; such whitespace does not form part 374 of the label or value. 376 The following are reserved metadata elements. The use of these 377 reserved metadata elements are OPTIONAL but encouraged. Reserved 378 metadata element names are case-insensitive. Except where indicated 379 otherwise, these metadata element names MAY be repeated to capture 380 multiple values. 382 Source-Organization Organization transferring the content. 384 Organization-Address Mailing address of the source organization. 386 Contact-Name Person at the source organization who is responsible 387 for the content transfer. 389 Contact-Phone International format telephone number of person or 390 position responsible. 392 Contact-Email Fully qualified email address of person or position 393 responsible. 395 External-Description A brief explanation of the contents and 396 provenance. 398 Bagging-Date Date (YYYY-MM-DD) that the content was prepared for 399 transfer. This metadata element SHOULD NOT be repeated. 401 External-Identifier A sender-supplied identifier for the bag. 403 Bag-Size Size or approximate size of the bag being transferred, 404 followed by an abbreviation such as MB (megabytes), GB, or TB; for 405 example, 42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum 406 (described next), Bag-Size is intended for human consumption. 407 This metadata element SHOULD NOT be repeated. 409 Payload-Oxum The "octetstream sum" of the payload, intended for the 410 purpose of quickly detecting incomplete bags before performing 411 checksum validation. This is strictly an optimization and 412 implementations MUST perform the standard checksum validation 413 process before proclaiming a bag to be valid. This element MUST 414 NOT be present more than once and, if present, MUST be in the form 415 "_OctetCount_._StreamCount_", where _OctetCount_ is the total 416 number of octets (8-bit bytes) across all payload file content and 417 _StreamCount_ is the total number of payload files. This metadata 418 element MUST NOT be repeated. 420 Bag-Group-Identifier A sender-supplied identifier for the set, if 421 any, of bags to which it logically belongs. This identifier 422 SHOULD be unique across the sender's content, and if recognizable 423 as belonging to a globally unique scheme, the receiver SHOULD make 424 an effort to honor reference to it. This metadata element SHOULD 425 NOT be repeated. 427 Bag-Count Two numbers separated by "of", in particular, "N of T", 428 where T is the total number of bags in a group of bags and N is 429 the ordinal number within the group; if T is not known, specify it 430 as "?" (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 431 145. This metadata element SHOULD NOT be repeated. If this 432 metadata element is present, it is RECOMMENDED to also include the 433 Bag-Group-Identifier element. 435 Internal-Sender-Identifier An alternate sender-specific identifier 436 for the content and/or bag. 438 Internal-Sender-Description A sender-local explanation of the 439 contents and provenance. 441 In addition to these metadata elements, other arbitrary metadata 442 elements MAY also be present. 444 An example "bag-info.txt" file 446 Source-Organization: FOO University 447 Organization-Address: 1 Main St., Cupertino, California, 11111 448 Contact-Name: Jane Doe 449 Contact-Phone: +1 111-111-1111 450 Contact-Email: example@example.com 451 External-Description: Uncompressed greyscale TIFF images from the 452 FOO papers colle... 453 Bagging-Date: 2008-01-15 454 External-Identifier: university_foo_001 455 Payload-Oxum: 279164409832.1198 456 Bag-Group-Identifier: university_foo 457 Bag-Count: 1 of 15 458 Internal-Sender-Identifier: /storage/images/foo 459 Internal-Sender-Description: Uncompressed greyscale TIFFs created 460 from microfilm and are... 462 2.2.3. Fetch File: fetch.txt 464 For reasons of efficiency, a bag MAY be sent with a list of files to 465 be fetched and added to the payload before it can meaningfully be 466 checked for completeness. The fetch file allows a bag to be 467 transmitted with "holes" in it, which can be practical for several 468 reasons. For example, it obviates the need for the sender to stage a 469 large serialized copy of the content while the bag is transferred to 470 the receiver. Also, this method allows a sender to construct a bag 471 from components that are either a subset of logically related 472 components (e.g., the localized logical object could be much larger 473 than what is intended for export) or assembled from logically 474 distributed sources (e.g., the object components for export are not 475 stored locally under one filesystem tree). An OPTIONAL tag file 476 called the fetch file contains such a list. 478 The fetch file MUST be named "fetch.txt". Every file listed in the 479 fetch file MUST be listed in every payload manifest. A fetch file 480 MUST NOT list any tag files. 482 Each line of a fetch file MUST be of the form: 484 url length filepath 486 where _url_ identifies the file to be fetched and MUST be an absolute 487 URI as defined in [RFC3986], _length_ is the number of octets in the 488 file (or "-", to leave it unspecified), and _filepath_ identifies the 489 corresponding payload file, relative to the base directory. 491 The slash character ('/') MUST be used as a path separator in 492 _filepath_. One or more linear whitespace characters (spaces or tabs) 493 MUST separate these three values, and any such characters in the 494 _url_ MUST be percent-encoded [RFC3986]. If _filename_ includes a 495 line feed (LF), a carriage return (CR), carriage return plus line 496 feed (CRLF) or percent sign (%), those characters (and only those) 497 MUST be percent-encoded following [RFC3986]. There is no limitation 498 on the length of any of the fields in the fetch file. 500 2.2.4. Other Tag Files 502 A bag MAY contain other tag files that are not defined by this 503 document. Implementations MUST perform standard checksum validation 504 on any tag file which is listed in a tag manifest but MUST otherwise 505 ignore their contents. 507 2.3. Text Tag File Format 509 All tag files specifically described in this document MUST adhere to 510 the text tag file format described below. Other tag files MAY adhere 511 to the text tag file format described below. 513 Text tag files are line-oriented, and each line MUST be terminated by 514 a line feed (LF), a carriage return (CR), or carriage return plus 515 newline (CRLF). It is RECOMMENDED that the last line in a tag file 516 also ends with LF, CR, or CRLF. Text tag file names MUST end in the 517 extension ".txt". 519 In all text tag files except for the bag declaration file, text MUST 520 use the character encoding specified in the "bagit.txt" bag 521 declaration file. Text tag files except for the bag declaration file 522 MAY include a byte-order mark (BOM) only if the specified encoding 523 requires it for proper decoding. In accordance with [RFC3629], when 524 "bagit.txt" specifies UTF-8 the tag files MUST NOT begin with a byte- 525 order mark (BOM). See Section 2.1.1 527 The use of UTF-8 for text tag files is strongly RECOMMENDED. A 528 future version of BagIt may disallow encodings other than UTF-8. 530 2.4. Bag Checksum Algorithms 532 The payload manifest and tag manifests permit validating the 533 integrity of the payload and tag files in a bag produced by the 534 checksum algorithms. Checksum values MUST be encoded so as to 535 conform to the manifest format specified in Section 2.1.3. However, 536 the internal details of a checksum are outside the scope of this 537 document. 539 To avoid future ambiguity, the checksum algorithm SHOULD be 540 registered in IANA's "Named Information Hash Algorithm Registry" 541 [ni-registry] according to [RFC6920], but MAY for backwards 542 compatibility also be MD5 [RFC1321] or SHA-1 [RFC3174]. 544 The name of the checksum algorithm MUST be normalized for use in the 545 manifest's filename by lowercasing the common name of the algorithm 546 and removing all non-alphanumeric characters. Following is a partial 547 list mapping common algorithm names to normalized names: 549 o MD5: md5 551 o SHA-1: sha1 553 o sha-256: sha256 555 o sha-512: sha512 557 Starting with BagIt 1.0, bag creation and validation tools MUST 558 support the SHA-256 and SHA-512 algorithms [RFC6234] and SHOULD 559 enable SHA-512 by default when creating new bags. For backwards 560 compatibility implementers SHOULD support MD5 [RFC1321] and SHA-1 561 [RFC3174]. Implementers are encouraged to simplify the process of 562 adding additional manifests using new algorithms to streamline the 563 process of in-place upgrades. 565 3. Complete and Valid bags 567 A _complete_ bag MUST meet the following requirements: 569 1. Every required element MUST be present (Section 2.1). 571 2. Every file listed in every tag manifest MUST be present. 573 3. Every file listed in every payload manifest MUST be present. 575 4. For BagIt 1.0, every payload file MUST be listed in every payload 576 manifest. Note that older versions of BagIt allowed payload 577 files to be listed in just one of the manifests. 579 5. Every element present MUST conform to BagIt 1.0. 581 A _valid_ bag MUST meet the following requirements: 583 1. The bag MUST be _complete_. 585 2. Every checksum in every payload manifest and tag manifest has 586 been successfully verified against the contents of the 587 corresponding file. 589 4. Examples 591 4.1. Example of a basic bag 593 This is the layout of a basic bag containing an image and a companion 594 OCR file. Lines of file content are shown with added parentheses to 595 indicate each complete line. For brevity this example uses MD5 596 rather than the recommended SHA-512. 598 myfirstbag/ 599 | 600 | manifest-md5.txt 601 | (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png) 602 | (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt) 603 | 604 | bagit.txt 605 | (BagIt-version: 1.0 ) 606 | (Tag-File-Character-Encoding: UTF-8 ) 607 | 608 \--- data/ 609 | 610 | 27613-h/images/q172.png 611 | (... image bytes ... ) 612 | 613 | 27613-h/images/q172.txt 614 | (... OCR text ... ) 615 .... 617 4.2. Example bag using fetch.txt 619 This is the layout of a bag which expects the receiver to download 620 the files listed in the payload manifests prior to validation. Lines 621 of file content are shown with added parentheses to indicate each 622 complete line. For brevity this example uses MD5 rather than the 623 recommended SHA-512. 625 highsmith-tahoe/ 626 | 627 | manifest-md5.txt 628 | (102b0e6effe208ef9b29864946de9e22 data/23364a.tif ) 629 | 630 | fetch.txt 631 | (https://cdn.loc.gov/master/pnp/highsm/23300/23364a.tif 632 | 216951362 data/23364a.tif ) 633 | 634 | bagit.txt 635 | (BagIt-version: 1.0 ) 636 | (Tag-File-Character-Encoding: UTF-8 ) 637 | 638 | bag-info.txt 639 | (Internal-Sender-Description: Download link found at ) 640 | ( https://www.loc.gov/resource/highsm.23364/ ) 642 5. Security Considerations 644 5.1. Special directory characters 646 The paths specified in the payload manifests, tag manifests, and 647 fetch files do not prohibit special directory characters which have 648 special meaning on some operating systems. Implementers MUST ensure 649 that files outside the bag directory structure are not accessed when 650 reading or writing files based on paths specified in a bag. 652 All implementations SHOULD have a test suite to guard against special 653 directory characters. 655 For example, a maliciously crafted "tagmanifest-sha512.txt" file 656 might contain entries which begin with a path character such as "/", 657 "..", or a "~username" home directory reference in an attempt to 658 cause a naive implementation to leak or overwrite targeted files on a 659 POSIX operating system. 661 Windows implementations SHOULD test their implementations to ensure 662 that safety-checks prevent use of drive letters and the less commonly 663 used namespace sequences (e.g. "\\?\C:\...") described in [MSFNAM]. 665 To assist implementers, the Library of Congress conformance suite 666 [LC-CONFORMANCE-SUITE] has some tests for invalid bags which are 667 expected to fail on POSIX or Windows clients. 669 5.2. Control of URLs in fetch.txt 671 Implementers of tools that complete bags by retrieving URLs listed in 672 a fetch file need to be aware that some of those URLs might point to 673 hosts, intentionally or unintentionally, that are not under control 674 of the bag's sender. Moreover, older checksum algorithms, even if 675 reasonable for detecting corruption during transit, may not offer 676 strong cryptographic protection against intentional spoofing. 678 5.3. File sizes in fetch.txt 680 The size of files, as optionally reported in the fetch file, cannot 681 be guaranteed to match the actual file size to be downloaded. 682 Implementers SHOULD take steps to monitor and abort transfer when the 683 received file size exceeds the file size reported in the fetch file. 684 Implementers SHOULD NOT use the file size in the fetch file for 685 critical resource allocation, such as buffer sizing or storage 686 requisitioning. 688 6. Practical Considerations (non-normative) 690 6.1. Interoperability 692 This section lists practical considerations for implementers and 693 users. None of the points below are required but they are 694 recommended for general-purpose usage. 696 Upon discovering errors in bags, an implementation is free to take 697 action (for example, logging or reporting) in an application-specific 698 manner. This document does not mandate any particular action. 700 The Library of Congress conformance suite [LC-CONFORMANCE-SUITE] is 701 provided as a public resource to test new implementations for 702 compatibility and error handling. 704 6.1.1. Filename normalization 706 This section provides background information on various challenges 707 caused by differences in how operating systems, filesystems, and 708 common tools handle filenames followed by a list of recommendations 709 for implementers in Section 6.1.1.3. 711 6.1.1.1. Case sensitivity 713 There are two challenges for interoperability related to filename 714 case: 716 o Filesystems such as FAT or EXFAT always convert filenames to 717 uppercase: "example.txt" will be stored as "EXAMPLE.TXT" 719 o Many Unix filesystems save filenames exactly as provided, allowing 720 multiple files which differ only in case: "example.txt" and 721 "Example.txt" are separate files 723 o NTFS and Apple's HFS Plus usually preserve case when storing files 724 but are case-insensitive when retrieving them. A file saved as 725 "Example.txt" will be retrieved by that name but will also be 726 retrieved as "EXAMPLE.TXT", "example.txt", etc. 728 6.1.1.2. Unicode normalization 730 The Unicode specification has common cases where different character 731 sequences produce the same human-meaningful text. These are referred 732 to as "canonically equivalent" and the Unicode specification defines 733 different normalization forms -- see [UNICODE-TR15] for the full 734 details and a brief example below: 736 The common surname "Nunez" normalized in different forms 738 Normalization Form D (Decomposition): 740 Char UTF8 Hex Name 741 ---------------------------------------------- 742 N 4e LATIN CAPITAL LETTER N 743 u 75 LATIN SMALL LETTER U 744 \u0301 cc81 COMBINING ACUTE ACCENT 745 n 6e LATIN SMALL LETTER N 746 \u0303 cc83 COMBINING TILDE 747 e 65 LATIN SMALL LETTER E 748 z 7a LATIN SMALL LETTER Z 750 Normalization Form C (Canonical Composition): 752 Char UTF8 Hex Name 753 ---------------------------------------------- 754 N 4e LATIN CAPITAL LETTER N 755 u c3ba LATIN SMALL LETTER U WITH ACUTE 756 n c3b1 LATIN SMALL LETTER N WITH TILDE 757 e 65 LATIN SMALL LETTER E 758 z 7a LATIN SMALL LETTER Z 759 Unicode normalization is relevant to BagIt implementors because 760 different systems have different standards for normalization: 762 o Apple's HFS Plus filesystem always normalizes filenames to a 763 fully-decomposed form based on the Unicode 2.0 specification (see 764 [TN1150]). 766 o Windows treats filenames as opaque character sequences (see 767 [MSFNAM]) and will store and return the encoded bytes exactly as 768 provided. 770 o Linux and other common Unix systems are generally similar to 771 Windows in storing and returning opaque byte streams but this 772 behaviour is technically filesystem-dependent. 774 o Utilities used for file management, transfer, and archival may 775 ignore this issue, apply an arbitrary normalization form, or allow 776 the user to control how normalization is applied. 778 In practice, this means that the encoded filename stored in a 779 manifest may fail a simple file existence check because the 780 filename's normalization was changed at some point after the manifest 781 was written. This situation is very confusing for users because the 782 filenames are visually indistinguishable and the "missing" file is 783 obviously present in the payload directory. 785 6.1.1.3. Recommendations 787 o Implementations SHOULD discourage the creation of bags containing 788 files which differ only in case. 790 o Implementations SHOULD prevent the creation of bags containing 791 files which differ only in normalization form. 793 o BagIt implementations SHOULD tolerate differences in normalization 794 form by comparing both the list of filesystem and manifest names 795 after applying the same normalization form to both. 797 o Implementations SHOULD issue a warning when multiple manifests are 798 present which differ only in case or normalization form. 800 6.1.2. Windows and Unix file naming 802 As specified above, only the Unix-based path separator ('/') may be 803 used inside filenames listed in BagIt manifest and fetch.txt files. 804 When bags are exchanged between Windows and Unix platforms, the path 805 separator SHOULD be translated as needed. Receivers of bags on 806 physical media SHOULD be prepared for filesystems created under 807 either Windows or Unix. Besides the fundamental difference between 808 path separators ('\' and '/'), generally, Windows filesystems have 809 more limitations than Unix filesystems. 811 Windows path names have a maximum of 255 characters, and none of 812 these characters may be used in a path component: 814 < > : " / | ? * 816 Windows also reserves the following names, with or without a file 817 extension: 819 CON, PRN, AUX, NUL 820 COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9 821 LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9 823 See [MSFNAM] for more information and possible alternatives. 825 6.1.3. Legacy checksum tools 827 Some bags have been manually assembled using checksum utilities such 828 as those contained in the GNU Coreutils package (md5sum, sha1sum, 829 etc.), collectively referred to here as "md5sum". Implementers who 830 desire wide support of legacy content should be aware of some known 831 quirks of these tools: 833 md5sum can be run in "text mode" which causes it to normalize line- 834 endings on some operating systems. On Unix-like systems both modes 835 will usually produce the same results but on systems like Windows 836 they can produce different results based on the file contents. The 837 md5sum output format has two characters between the checksum and the 838 filepath: the first is always a space and the second is an asterisk 839 ("*") for binary mode and a space for text mode. 841 A final note about md5sum-generated manifests is that for a 842 _filepath_ containing a backslash ('\'), the manifest line will have 843 a backslash inserted in front of the _checksum_ and, under Windows, 844 the backslashes inside _filepath_ can be doubled. 846 Implementers MAY wish to accept this format by ignoring a leading 847 asterisk or handling differences in line termination gracefully but, 848 if so, implementations MUST warn the user that the bag in question 849 will fail strict validation. In such cases it is RECOMMENDED that 850 tools provide an easy option to update the bag with valid manifests. 852 7. Augmented Backus-Naur Form (non-normative) 854 The Augmented Backus-Naur Form (ABNF) rules provided below are non- 855 normative. If there is a discrepancy between requirements in the 856 normative sections and the ABNF, the requirements in the normative 857 sections prevail. Some definitions use the core rules (e.g. DIGIT, 858 HEXDIG, etc) as defined in [RFC4234] 860 7.1. Bag Declaration: bagit.txt 862 bagit.txt ABNF rules: 864 bagit-txt = "BagIt-Version: " 1*DIGIT "." 1*DIGIT ending 865 "Tag-File-Character-Encoding: " encoding ending 866 encoding = 1*CHAR 867 ending = CR / LF / CRLF 869 7.2. Payload Manifest: manifest-algorithm.txt 871 Payload Manifest ABNF rules: 873 payload-manifest = 1*payload-manifest-line 874 payload-manifest-line = checksum 1*WSP filepath ending 875 checksum = 1*case-hexdig 876 case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / 877 "D" / "d" / "E"/ "e"/ "F" / "f" 878 filepath = "data/" 879 1*( unreserved / pct-encoded / sub-delims ) 880 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" 881 sub-delims = "!" / "$" / "&" / DQUOTE / "'" / "(" / ")" / 882 "*" / "+" / "," / ";" / "=" / "/" 883 pct-encoded = "%0D" / "%0d" / "%0A" / "%0a" / "%25" 884 ending = CR / LF / CRLF 886 7.3. Bag Metadata: bag-info.txt 888 bag-info.txt ABNF rules: 890 metadata = 1*metadata-line 891 metadata-line = key ":" WSP value ending *(continuation ending) 892 key = 1*non-reserved 893 value = 1*non-reserved 894 continuation = WSP 1*non-reserved 895 non-reserved = VCHAR / WSP 896 ; any valid character for the specific encoding 897 ; except those that match "ending" 898 ending = CR / LF / CRLF 900 7.4. Fetch File: fetch.txt 902 fetch.txt ABNF rules: 904 fetch = 1*fetch-line 905 fetch-line = url 1*WSP length 1*WSP filepath ending 906 url = 907 length = 1*DIGIT / "-" 908 filepath = ("data/" 909 1*( unreserved / pct-encoded / sub-delims )) 910 ending = CR / LF / CRLF 912 8. Contributors 914 Additional contributors to the authoring of BagIt are Andy Boyko, 915 David Brunton, Rosie Storey, Ed Summers, Brian Vargas, and Kate 916 Zwaard. 918 9. Acknowledgements 920 BagIt benefitted from the thoughtful assistance of Stephen Abrams, 921 Mike Ashenfelder, Dan Chudnov, Dave Crocker, Scott Fisher, Brad 922 Hards, Erik Hetzner, Keith Johnson, Leslie Johnston, David Loy, Mark 923 Phillips, Tracy Seneca, Stian Soiland-Reyes, Brian Tingle, Adam 924 Turoff, and Jim Tuttle. 926 10. IANA Considerations 928 This draft does not request any action from IANA. 930 11. References 932 11.1. Normative References 934 [ni-registry] 935 IANA, "Named Information Hash Algorithm Registry", 9 2016, 936 . 939 [RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321, 940 DOI 10.17487/RFC1321, April 1992, 941 . 943 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 944 Requirement Levels", BCP 14, RFC 2119, 945 DOI 10.17487/RFC2119, March 1997, 946 . 948 [RFC2978] Freed, N. and J. Postel, "IANA Charset Registration 949 Procedures", BCP 19, RFC 2978, DOI 10.17487/RFC2978, 950 October 2000, . 952 [RFC3174] Eastlake 3rd, D. and P. Jones, "US Secure Hash Algorithm 1 953 (SHA1)", RFC 3174, DOI 10.17487/RFC3174, September 2001, 954 . 956 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 957 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 958 2003, . 960 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 961 Resource Identifier (URI): Generic Syntax", STD 66, 962 RFC 3986, DOI 10.17487/RFC3986, January 2005, 963 . 965 [RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms 966 (SHA and SHA-based HMAC and HKDF)", RFC 6234, 967 DOI 10.17487/RFC6234, May 2011, 968 . 970 [RFC6920] Farrell, S., Kutscher, D., Dannewitz, C., Ohlman, B., 971 Keranen, A., and P. Hallam-Baker, "Naming Things with 972 Hashes", RFC 6920, DOI 10.17487/RFC6920, April 2013, 973 . 975 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 976 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 977 May 2017, . 979 11.2. Informative References 981 [ENCDEP] Tabata, K., "A Collaboration Model between Archival 982 Systems to Enhance the Reliability of Preservation by an 983 Enclose-and-Deposit Method", 2005, 984 . 986 [LC-CONFORMANCE-SUITE] 987 The Library of Congress, "BagIt Conformance Suite", 2016-, 988 . 991 [MSFNAM] Microsoft, Inc., "Naming a File", 2008, 992 . 994 [RFC4234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax 995 Specifications: ABNF", RFC 4234, DOI 10.17487/RFC4234, 996 October 2005, . 998 [TN1150] Apple Inc., "Technical Note TN1150: HFS Plus Volume 999 Format", 3 2004, 1000 . 1003 [UNICODE-TR15] 1004 Unicode Consortium, "Unicode(R) Standard Annex #15: 1005 Unicode Normalization Forms", 2 2016, 1006 . 1008 Authors' Addresses 1010 John A. Kunze 1011 California Digital Library 1012 415 20th St, 4th Floor 1013 Oakland, CA 94612 1014 US 1016 Email: jak@ucop.edu 1018 Justin Littman 1019 George Washington University Libraries 1020 2130 H St NW 1021 Washington, DC 20052 1022 USA 1024 Email: justinlittman@gwu.edu 1026 Liz Madden 1027 Library of Congress 1028 101 Independence Avenue SE 1029 Washington, DC 20540 1030 USA 1032 Email: emad@loc.gov 1033 John Scancella 1034 Library of Congress 1035 101 Independence Avenue SE 1036 Washington, DC 20540 1037 USA 1039 Email: jsca@loc.gov 1041 Chris Adams 1042 Library of Congress 1043 101 Independence Avenue SE 1044 Washington, DC 20540 1045 USA 1047 Email: cadams@loc.gov