idnits 2.17.1 draft-kunze-bagit-14.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 21, 2016) is 2738 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Kunze 3 Internet-Draft California Digital Library 4 Intended status: Informational J. Littman 5 Expires: April 24, 2017 George Washington University Libraries 6 L. Madden 7 Library of Congress 8 E. Summers 9 University of Maryland 10 A. Boyko 12 B. Vargas 13 October 21, 2016 15 The BagIt File Packaging Format (V0.97) 16 draft-kunze-bagit-14 18 Abstract 20 This document specifies BagIt, a hierarchical file packaging format 21 for storage and transfer of arbitrary digital content. A "bag" has 22 just enough structure to enclose descriptive "tags" and a "payload" 23 but does not require knowledge of the payload's internal semantics. 24 This BagIt format should be suitable for disk-based or network-based 25 storage and transfer. BagIt is widely used in the practice of 26 digital preservation. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at http://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on April 24, 2017. 45 Copyright Notice 47 Copyright (c) 2016 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 63 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 3 64 1.2. Requirements . . . . . . . . . . . . . . . . . . . . . . 3 65 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 66 2. Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 4 67 2.1. Required Elements . . . . . . . . . . . . . . . . . . . . 5 68 2.1.1. Bag Declaration: bagit.txt . . . . . . . . . . . . . 5 69 2.1.2. Payload Directory: data/ . . . . . . . . . . . . . . 5 70 2.1.3. Payload Manifest: manifest-.txt . . . . . . . . 5 71 2.2. Optional Elements . . . . . . . . . . . . . . . . . . . . 6 72 2.2.1. Tag Manifest: tagmanifest-.txt . . . . . . . . . 6 73 2.2.2. Bag Metadata: bag-info.txt . . . . . . . . . . . . . 6 74 2.2.3. Fetch File: fetch.txt . . . . . . . . . . . . . . . . 8 75 2.2.4. Other Tag Files . . . . . . . . . . . . . . . . . . . 9 76 2.3. Text Tag File Format . . . . . . . . . . . . . . . . . . 9 77 2.4. Bag Checksum Algorithms . . . . . . . . . . . . . . . . . 10 78 3. Complete, Incomplete, and Valid bags . . . . . . . . . . . . 10 79 4. Serialization . . . . . . . . . . . . . . . . . . . . . . . . 11 80 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 12 81 5.1. Example of a basic bag . . . . . . . . . . . . . . . . . 12 82 5.2. Another example bag . . . . . . . . . . . . . . . . . . . 12 83 6. Security Considerations . . . . . . . . . . . . . . . . . . . 14 84 6.1. Special directory characters . . . . . . . . . . . . . . 14 85 6.2. Control of URLs in fetch.txt . . . . . . . . . . . . . . 14 86 6.3. File sizes in fetch.txt . . . . . . . . . . . . . . . . . 14 87 7. Practical Considerations (non-normative) . . . . . . . . . . 14 88 7.1. Disk and network transfer . . . . . . . . . . . . . . . . 14 89 7.2. Interoperability . . . . . . . . . . . . . . . . . . . . 15 90 7.2.1. Checksum tools . . . . . . . . . . . . . . . . . . . 15 91 7.2.2. Windows and Unix file naming . . . . . . . . . . . . 16 92 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 16 93 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 94 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 16 95 10.1. Normative References . . . . . . . . . . . . . . . . . . 16 96 10.2. Informative References . . . . . . . . . . . . . . . . . 17 97 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 17 99 1. Introduction 101 1.1. Purpose 103 BagIt is a hierarchical file packaging format designed to support 104 disk-based or network-based storage and transfer of arbitrary digital 105 content. A bag consists of a "payload" and "tags". The content of 106 the payload is the custodial focus of the bag and is treated as 107 semantically opaque. The "tags" are metadata files intended to 108 facilitate and document the storage and transfer of the bag. The 109 name, BagIt, is inspired by the "enclose and deposit" method 110 [ENCDEP], sometimes referred to as "bag it and tag it". 112 BagIt is widely used for preserving digital assets originating from a 113 different domains. Organizations involved in digital preservation 114 with BagIt include the Library of Congress, Dryad Data Repository, 115 NSF DataONE, and the Rockefeller Archive Center. Software 116 implementations have been written in Python, Ruby, Java, Perl, and 117 PHP. It is also used in the libraries of many universities, such as 118 Cornell, Purdue, Stanford, Ghent University, New York University, and 119 the University of California. 121 Implementors of BagIt tools should consider interoperability between 122 different platforms, operating systems, toolsets, and languages. 123 Differences in path separators, newline characters, reserved file 124 names, and maximum path lengths are all possible barriers to moving 125 bags between different systems. Discussion of these issues may be 126 found in the Interoperability section of this document. 128 1.2. Requirements 130 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 131 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 132 document are to be interpreted as described in [RFC2119]. 134 1.3. Terminology 136 This specification uses a number of terms to describe BagIt, some of 137 which are in common use, some of which are newly defined by this 138 specification, and others which may have meanings obvious only to 139 those in the community from which this spec arose. Terms defined in 140 this section are intended to clarify any ambiguity. 142 bag A set of opaque data contained within the structure defined by 143 this specification. 145 bag declaration The tag file required to be in all bags conforming 146 to this specification. Contains tags necessary for bootstrapping 147 the reading and processing of the rest of a bag. See 148 Section 2.1.1. 150 bag checksum algorithm A reference to a cryptographic checksum 151 algorithm, such as SHA1 or SHA256, with its name normalized for 152 use in a manifest or tag manifest file name. See Section 2.4. 154 complete A bag which comprises all elements required by this 155 specification, with all files listed in all payload and tag 156 manifests present, all payload files present listed in at least 157 one manifest. See Section 3. 159 payload The data encapsulated by the bag. The contents of the 160 payload are opaque to this specification, and are always 161 considered as a set of octet streams. See Section 2.1.2. 163 serialized bag A bag that has been serialized into a single, 164 monolithic file. See Section 4. 166 tag directory A directory that contains one or more tag files. 168 tag file A file that contains metadata intended to facilitate and 169 document the storage and transfer of the bag. 171 valid A complete bag wherein every checksum in every payload 172 manifest and tag manifest can be successfully verified against the 173 corresponding payload file. See Section 2.1.2. 175 2. Structure 177 A bag consists of a base directory containing (1) a set of required 178 and optional tag files; (2) a sub-directory named "data", called the 179 payload directory; and (3) a set of optional tag directories. The 180 payload files in the payload directory are an arbitrary file 181 hierarchy (see Section 2.1.2). The tag files in the base directory 182 consist of one or more files named "manifest-_algorithm_.txt" (see 183 Section 2.1.3), a file named "bagit.txt" (see Section 2.1.1), and 184 zero or more additional tag files (see Section 2.2). The tag files 185 in the optional tag directories are arbitrary file hierarchies and 186 the tag directories MAY have any name that is not reserved for a file 187 or directory in this specification. 189 The base directory MAY have any name. 191 / 192 | bagit.txt 193 | manifest-.txt 194 | [optional additional tag files] 195 \--- data/ 196 | [payload files] 197 \--- [optional tag directories]/ 198 | [optional tag files] 200 2.1. Required Elements 202 2.1.1. Bag Declaration: bagit.txt 204 The "bagit.txt" tag file MUST consist of exactly two lines: 206 BagIt-Version: M.N 207 Tag-File-Character-Encoding: UTF-8 209 where M.N identifies the BagIt major (M) and minor (N) version 210 numbers, and UTF-8 identifies the character set encoding of tag 211 files. The bag declaration MUST be encoded in UTF-8, and MUST NOT 212 contain a byte-order mark (BOM). [RFC3629] 214 The appropriate version for a bag that conforms to this version of 215 the specification is "0.97". 217 2.1.2. Payload Directory: data/ 219 The base directory MUST contain a sub-directory named "data", called 220 the payload directory. 222 The payload directory contains the custodial content within the bag. 223 The files under the payload directory are called payload files, or 224 the payload. The payload is treated as octet streams for all 225 purposes relating to this specification, and is not otherwise 226 prescribed. 228 2.1.3. Payload Manifest: manifest-.txt 230 A payload manifest is a tag file that lists payload files and 231 checksums for those payload files generated using a particular bag 232 checksum algorithm. Every bag MUST contain at least one payload 233 manifest file. A payload manifest file MUST have a name of the form 234 manifest-_algorithm_.txt, where _algorithm_ is a string specifying 235 the bag checksum algorithm used in that manifest, such as: 237 manifest-sha256.txt 238 manifest-sha1.txt 239 A bag MUST NOT contain more than one payload manifest for a 240 particular bag checksum algorithm. 242 Each line of a payload manifest file MUST be of the form: 244 CHECKSUM FILENAME 246 where FILENAME is the pathname of a file relative to the base 247 directory and CHECKSUM is a hex-encoded checksum calculated according 248 to _algorithm_ over every octet in the file. The hex-encoded 249 checksum MAY use uppercase and/or lowercase letters. The slash 250 character ('/') MUST be used as a path separator in FILENAME. One or 251 more linear whitespace characters (spaces or tabs) MUST separate 252 CHECKSUM from FILENAME. An asterisk ('*') MAY preceed FILENAME for 253 interoperability on some platforms (see Section 7.2.1). There is no 254 limitation on the length of a pathname. The payload manifest MUST 255 NOT reference files outside the payload directory. 257 Payload manifests only include the pathnames of files. Because of 258 this, a payload manifest cannot reference empty directories. To 259 account for an empty directory, a bag creator may wish to include at 260 least one file in that directory; it suffices, for example, to 261 include a zero-length file named ".keep". 263 2.2. Optional Elements 265 2.2.1. Tag Manifest: tagmanifest-.txt 267 A tag manifest is a tag file that lists other tag files and checksums 268 for those tag files generated using a particular bag checksum 269 algorithm. A bag MAY contain one or more tag manifests. A tag 270 manifest file MUST have a name of the form "tagmanifest- 271 _algorithm_.txt", where _algorithm_ is a string specifying the bag 272 checksum algorithm used in that manifest, such as: 274 tagmanifest-sha256.txt 275 tagmanifest-sha1.txt 277 A tag manifest file has the same form as the payload file manifest 278 file described in Section 2.1.3, but MUST NOT list any payload files. 279 As a result, no FILENAME listed in a tag manifest begins "data/". 281 2.2.2. Bag Metadata: bag-info.txt 283 The "bag-info.txt" file is a tag file that contains metadata elements 284 describing the bag and the payload. The metadata elements contained 285 in the "bag-info.txt" file are intended primarily for human 286 readability. All metadata elements are optional and MAY be repeated. 288 Implementations SHOULD assume that the ordering is significant and 289 provide access to the metadata elements in the order they are given 290 in the "bag-info.txt" file. 292 A metadata element MUST consist of a label, a colon, and a value, 293 each separated by optional whitespace. The label MUST start in 294 column 1. It is RECOMMENDED that lines not exceed 79 characters in 295 length. Long values may be continued onto the next line by inserting 296 a newline (LF), a carriage return (CR), or carriage return plus 297 newline (CRLF) and indenting the next line with linear white space 298 (spaces or tabs). 300 Reserved metadata element names are case-insensitive and defined as 301 follows. 303 Source-Organization Organization transferring the content. 305 Organization-Address Mailing address of the organization. 307 Contact-Name Person at the source organization who is responsible 308 for the content transfer. 310 Contact-Phone International format telephone number of person or 311 position responsible. 313 Contact-Email Fully qualified email address of person or position 314 responsible. 316 External-Description A brief explanation of the contents and 317 provenance. 319 Bagging-Date Date (YYYY-MM-DD) that the content was prepared for 320 delivery. 322 External-Identifier A sender-supplied identifier for the bag. 324 Bag-Size Size or approximate size of the bag being transferred, 325 followed by an abbreviation such as MB (megabytes), GB, or TB; for 326 example, 42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum 327 (described next), Bag-Size is intended for human consumption. 329 Payload-Oxum The "octetstream sum" of the payload, namely, a two- 330 part number of the form "OctetCount.StreamCount", where OctetCount 331 is the total number of octets (8-bit bytes) across all payload 332 file content and StreamCount is the total number of payload files. 333 Payload-Oxum should be included in "bag-info.txt" if at all 334 possible. Compared to Bag-Size (above), Payload-Oxum is intended 335 for machine consumption. 337 Bag-Group-Identifier A sender-supplied identifier for the set, if 338 any, of bags to which it logically belongs. If this identifier is 339 recognizable as belonging to a globally unique scheme, the 340 receiver should make an effort to honor reference to it. 342 Bag-Count Two numbers separated by "of", in particular, "N of T", 343 where T is the total number of bags in a group of bags and N is 344 the ordinal number within the group; if T is not known, specify it 345 as "?" (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 346 145. 348 Internal-Sender-Identifier An alternate sender-specific identifier 349 for the content and/or bag. 351 Internal-Sender-Description A sender-local prose description of the 352 contents of the bag. 354 In addition to these metadata elements, other arbitrary metadata 355 elements may also be present. 357 Here is an example "bag-info.txt" file. 359 Source-Organization: Spengler University 360 Organization-Address: 1400 Elm St., Cupertino, California, 95014 361 Contact-Name: Edna Janssen 362 Contact-Phone: +1 408-555-1212 363 Contact-Email: ej@spengler.edu 364 External-Description: Uncompressed greyscale TIFF images from the 365 Yoshimuri papers colle... 366 Bagging-Date: 2008-01-15 367 External-Identifier: spengler_yoshimuri_001 368 Bag-Size: 260 GB 369 Payload-Oxum: 279164409832.1198 370 Bag-Group-Identifier: spengler_yoshimuri 371 Bag-Count: 1 of 15 372 Internal-Sender-Identifier: /storage/images/yoshimuri 373 Internal-Sender-Description: Uncompressed greyscale TIFFs created 374 from microfilm and are... 376 2.2.3. Fetch File: fetch.txt 378 For reasons of efficiency, a bag MAY be sent with a list of files to 379 be fetched and added to the payload before it can meaningfully be 380 checked for completeness. An OPTIONAL tag file named "fetch.txt" 381 contains such a list. Each line of "fetch.txt" has the form 383 URL LENGTH FILENAME 384 where URL [RFC3986] identifies the file to be fetched, LENGTH is the 385 number of octets in the file (or "-", to leave it unspecified), and 386 FILENAME identifies the corresponding payload file, relative to the 387 base directory. The slash character ('/') MUST be used as a path 388 separator in FILENAME. If FILENAME begins with a slash character, 389 the destination MUST still be treated as relative to the bag base 390 directory. One or more linear whitespace characters (spaces or tabs) 391 MUST separate these three values, and any such characters in the URL 392 MUST be percent-encoded. There is no limitation on the length of any 393 of the fields in the "fetch.txt". 395 The "fetch.txt" file allows a bag to be transmitted with "holes" in 396 it, which can be practical for several reasons. For example, it 397 obviates the need for the sender to stage a large serialized copy of 398 the content while the bag is transferred to the receiver. Also, this 399 method allows a sender to construct a bag from components that are 400 either a subset of logically related components (e.g., the localized 401 logical object could be much larger than what is intended for export) 402 or assembled from logically distributed sources (e.g., the object 403 components for export are not stored locally under one filesystem 404 tree). 406 2.2.4. Other Tag Files 408 A bag MAY contain other tag files that are not defined by this 409 specification. Implementations SHOULD ignore the content of any 410 unexpected tag files, except when they are listed in a tag manifest. 411 When unexpected tag files are listed in a tag manifest, 412 implementations MUST only treat the content of those tag files as 413 octet streams for the purpose of checksum verification. 415 2.3. Text Tag File Format 417 All tag files specifically described in this specification MUST 418 adhere to the text tag file format described below. Other tag files 419 MAY adhere to the text tag file format described below. 421 Text tag files are line-oriented, and each line MUST be terminated by 422 a newline (LF), a carriage return (CR), or carriage return plus 423 newline (CRLF). Text tag files MUST end in the extension ".txt". 425 In all text tag files except for the bag declaration file, text MUST 426 be encoded in the character encoding specified in the "bagit.txt" bag 427 declaration file. Text tag files except for the bag declaration file 428 MAY include a byte-order mark (BOM) only if the specified encoding 429 requires it for proper decoding. (Note that UTF-8 does not.) 430 As specified in Section 2.1.1, the bag declaration file must be 431 encoded in UTF-8 and must not include a byte-order mark. 433 2.4. Bag Checksum Algorithms 435 The payload manifest and tag manifests assert integrity of the 436 payload and tags in a bag using checksum algorithms. The operation 437 of those algorithms, and the formatting of their output within a 438 manifest file, are generally beyond the scope of this specification, 439 except that the output format MUST be able to fit in the manifest 440 format specified in Section 2.1.3. 442 The name of the checksum algorithm MUST be normalized for use in the 443 manifest's filename by lowercasing the common name of the algorithm 444 and removing all non-alphanumeric characters. 446 Implementors of tools that create and validate bags SHOULD support at 447 least two widely implemented checksum algorithms: "md5" [RFC1321] and 448 "sha1" [RFC3174]. The authors recognize that, compared with newer 449 algorithms [RFC6234], these two algorithms now have well-known 450 vulnerabilities that render them inadequate for applications 451 requiring secure change detection. 453 3. Complete, Incomplete, and Valid bags 455 A _complete_ bag MUST have the following attributes: 457 1. Every required element MUST be present (Section 2.1). 459 2. Every file in every payload manifest MUST be present. 461 3. Every file in every tag manifest MUST be present. Tag files not 462 listed in a tag manifest MAY be present. 464 4. Every payload file MUST be listed in at least one manifest. 465 Payload files MAY be listed in more than one payload manifest. 467 5. Every element present MUST comply with this specification. 469 A bag is _incomplete_ when it exhibits any of the following 470 exceptions to the attributes of a complete bag: 472 1. One or more files in any payload manifest are absent. 474 2. One or more files in any tag manifest are absent. 476 3. A fetch.txt is present. Any files listed in any payload manifest 477 or any tag manifest which are absent MUST be listed in the 478 fetch.txt. 480 A _valid_ bag must have the following attributes: 482 1. The bag MUST be complete. 484 2. Every CHECKSUM in every payload manifest and tag manifest can be 485 sucessfully verified against the contents of its corresponding 486 FILENAME. 488 If a bag is neither valid, complete, nor incomplete, it is _invalid_. 489 Definitions for the various ways a bag may be invalid are not covered 490 by this specification. 492 Tag files that do not appear in a tag manifest can be modified, added 493 to, or removed from a bag without impacting the completeness or 494 validity of the bag. 496 4. Serialization 498 In some scenarios, it might be convenient to serialize the bag's 499 filesystem hierarchy (i.e., the base directory) into a single-file 500 archive format such as TAR or ZIP (the serialization) and then later 501 deserialize the serialization to recreate the filesystem hierarchy. 502 Several rules govern the serialization of a bag and apply equally to 503 all types of archive files: 505 1. The top-level directory of a serialization MUST contain only one 506 bag. 508 2. The serialization SHOULD have the same name as the bag's base 509 directory, but MUST have an extension added to identify the 510 format. For example, the receiver of "mybag.tar.gz" expects the 511 corresponding base directory to be created as "mybag". 513 3. A bag MUST NOT be serialized from within its base directory, but 514 from the parent of the base directory (where the base directory 515 appears as an entry). Thus, after a bag is deserialized in an 516 empty directory, a listing of that directory shows exactly one 517 entry. For example, deserializing "mybag.zip" in an empty 518 directory causes the creation of the base directory "mybag" and, 519 beneath "mybag", the creation of all payload and tag files. 521 4. The deserialization of a bag MUST produce a single base directory 522 bag with the top-level structure as described in this 523 specification without requiring any additional un-archiving step. 525 For example, after one un-archiving step it would be an error for 526 the "data/" directory to appear as "data.tar.gz". TAR and ZIP 527 files may appear inside the payload beneath the "data/" 528 directory, where they would be treated as any other payload file. 530 When serializing a bag, care must be taken to ensure that the archive 531 format's restrictions on file naming, such as allowable characters, 532 length, or character encoding, will support the requirements of the 533 systems on which it will be used. See Section 7.2. 535 5. Examples 537 5.1. Example of a basic bag 539 This is the layout of a basic bag containing an image and a companion 540 OCR file. Lines of file content are shown in parentheses beneath the 541 file name. For brevity, examples use the md5 checksum algorithm. 543 myfirstbag/ 544 | 545 | manifest-md5.txt 546 | (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png) 547 | (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt) 548 | 549 | bagit.txt 550 | (BagIt-version: 0.96 ) 551 | (Tag-File-Character-Encoding: UTF-8 ) 552 | 553 \--- data/ 554 | 555 | 27613-h/images/q172.png 556 | (... image bytes ... ) 557 | 558 | 27613-h/images/q172.txt 559 | (... OCR text ... ) 560 .... 562 5.2. Another example bag 564 The following example bag contains content from a web crawler. As 565 before, lines of file content are shown in parentheses beneath the 566 file name, with long lines continued indented on subsequent lines. 567 This bag is not complete until every component listed in the 568 "fetch.txt" file is retrieved. 570 mysecondbag/ 571 | 572 | manifest-md5.txt 573 | (93c53193ef96732c76e00b3fdd8f9dd3 data/Collection Overview.txt ) 574 | (e9c5753d65b1ef5aeb281c0bb880c6c8 data/Seed List.txt ) 575 | (61c96810788283dc7be157b340e4eff4 data/gov-20060601-050019.arc.gz) 576 | (55c7c80c6635d5a4c8fe76a940bf353e data/gov-20060601-100002.arc.gz) 577 | 578 | fetch.txt 579 | (http://WB20.Stanford.Edu/gov-06-2006/gov-20060601-050019.arc.gz 580 | 26583985 data/gov-20060601-050019.arc.gz ) 581 | (http://WB20.Stanford.Edu/gov-06-2006/gov-20060601-100002.arc.gz 582 | 99509720 data/gov-20060601-100002.arc.gz ) 583 | ( ...............................................................) 584 | 585 | bag-info.txt 586 | (Source-organization: California Digital Library ) 587 | (Organization-address: 415 20th St, 4th Floor, Oakland, CA 94612) 588 | (Contact-name: A. E. Newman ) 589 | (Contact-phone: +1 510-555-1234 ) 590 | (Contact-email: alfred@ucop.edu ) 591 | (External-Description: The collection "Local Davis Flood Control ) 592 | Collection" includes captured California State and local ) 593 | websites containing information on flood control resources for ) 594 | the Davis and Sacramento area. Sites were captured by UC Davis) 595 | curator Wrigley Spyder using the Web Archiving Service in ) 596 | February 2007 and October 2007. ) 597 | (Bag-date: 2008.04.15 ) 598 | (External-identifier: ark:/13030/fk4jm2bcp ) 599 | (Bag-size: about 22Gb ) 600 | (Payload-Oxum: 21836794142.831 ) 601 | (Internal-sender-identifier: UCDL ) 602 | (Internal-sender-description: UC Davis Libraries ) 603 | 604 | bagit.txt 605 | (BagIt-version: 0.96 ) 606 | (Tag-File-Character-Encoding: UTF-8 ) 607 | 608 \--- data/ 609 | 610 | Collection Overview.txt 611 | (... narrative description ... ) 612 | 613 | Seed List.txt 614 | (... list of crawler starting point URLs ... ) 615 .... 617 6. Security Considerations 619 6.1. Special directory characters 621 The paths specified in the payload manifest, tag manifest, and 622 "fetch.txt" file do not prohibit special directory characters which 623 might be significant on implementing systems. Implementors SHOULD 624 take care that files outside the bag directory structure are not 625 accessed when reading or writing files based on paths specified in a 626 bag. 628 For example, path characters such as ".." or "~" in a maliciously 629 crafted "fetch.txt" file might cause a naive implementation to 630 overwrite critical system files. 632 6.2. Control of URLs in fetch.txt 634 Implementors of tools that complete bags by retrieving URLs listed in 635 a "fetch.txt" file need to be aware that some of those URLs may point 636 to hosts, intentionally or unintentionally, that are not under 637 control of the bag's sender. Checksums are intended as a reasonable 638 guarantee against corruption during transit, not a strong 639 cryptographic protection against intentional spoofing. 641 6.3. File sizes in fetch.txt 643 The size of files, as optionally reported in the "fetch.txt" file, 644 cannot be guaranteed to match the actual file size to be downloaded. 645 Implementors SHOULD take care to appropriately handle cases where the 646 actual file size does not match the file size reported in the 647 fetch.txt. Implementors SHOULD NOT use the file size in the 648 "fetch.txt" file for critical resource allocation, such as buffer 649 sizing or storage requisitioning. 651 7. Practical Considerations (non-normative) 653 7.1. Disk and network transfer 655 When creating a bag on physical media (such as hard disk, CD-ROM, or 656 DVD) for transfer to another organization, the sender should select 657 and format the media in a manner compatible with both the content 658 requirements (e.g., file names and sizes) and the receiver's 659 technical infrastructure. If the receiver's infrastructure is not 660 known or the media needs to be compatible with a range of potential 661 receivers, consideration should be given to portability and common 662 usage. For example, a "lowest common denominator" for some potential 663 receivers could be USB disk drives formatted with the FAT32 664 filesystem. 666 Although overall bag size is unlimited in principle, network-based 667 transfers might involve constraints on the amount of bag data that a 668 receiver can receive at one time. It might be practical to split a 669 large bag into several smaller bags. 671 Transmitting a whole bag in serialized form as a single file will 672 tend to be the most straightforward mode of transfer. When 673 throughput is a priority, use of "fetch.txt" lends itself to an easy, 674 application-level parallelism in which the list of URL-addressed 675 items to fetch is divided among multiple processes. The mechanics of 676 sending and receiving bags over networks is otherwise out of scope of 677 the present document and might be facilitated by protocols such as 678 [GRABIT] and [SWORD]. 680 7.2. Interoperability 682 This section is not part of the BagIt specification. It describes 683 some practical considerations for bag creators and receivers circa 684 2010. 686 7.2.1. Checksum tools 688 Some cautions regarding bag interchange arise in regard to the 689 commonly available checksum tools distributed with the GNU Coreutils 690 package (md5sum, sha1sum, sha256sum, etc.), collectively referred to 691 here as "sha256sum". First, sha256sum can be run in binary or text 692 mode; text mode sometimes normalizes line-endings. While these modes 693 appear to produce the same checksums under Unix-like systems, they 694 can produce different checksums under Windows. When using sha256sum, 695 it might be safest to run it in binary mode, with one caveat: a side- 696 effect of binary mode is that sha256sum requires a space and an 697 asterisk ('*'), compared to two spaces in text mode, between the 698 CHECKSUM and FILENAME in its manifest format. 700 Due to the widespread use of sha256sum (and its relatives), it is not 701 unexpected for bag receivers to see manifests in which CHECKSUM and 702 FILENAME are separated by a space followed by an asterisk. 703 Implementors creating or processing bags with sha256sum should be 704 aware of these subtle differences, and ensure compliance with the 705 manifest specification in this document. Implementors creating and 706 processing bags with other tools might wish to be tolerant of 707 asterisks found in the manifests. 709 A final note about sha256sum-generated manifests is that for a 710 FILENAME containing a backslash ('\'), the manifest line will have a 711 backslash inserted in front of the CHECKSUM and, under Windows, the 712 backslashes inside FILENAME might be doubled. 714 7.2.2. Windows and Unix file naming 716 As specified above, only the Unix-based path separator ('/') may be 717 used inside filenames listed in BagIt manifests and "fetch.txt" 718 files. When bags are exchanged between Windows and Unix platforms, 719 care should be taken to translate the path separator as needed. 720 Receivers of bags on physical media should be prepared for 721 filesystems created under either Windows or Unix. Besides the 722 fundamental difference between path separators ('\' and '/'), 723 generally, Windows filesystems have more limitations than Unix 724 filesystems. Windows path names have a maximum of 255 characters, 725 and none of these characters may be used in a path component: 727 < > : " / | ? * 729 Windows also reserves the following names: CON, PRN, AUX, NUL, COM1, 730 COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, 731 LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. See [MSFNAM] for more 732 information. 734 8. Acknowledgements 736 BagIt owes much to many thoughtful contributers and reviewers, 737 including Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Brad Hards, 738 Scott Fisher, Keith Johnson, Erik Hetzner, Leslie Johnston, David 739 Loy, Mark Phillips, Tracy Seneca, Brian Tingle, Adam Turoff, and Jim 740 Tuttle. 742 9. IANA Considerations 744 This draft does not request any action from IANA. 746 10. References 748 10.1. Normative References 750 [MSFNAM] Microsoft, , "Naming a File", 2008, 751 . 753 [RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321, 754 DOI 10.17487/RFC1321, April 1992, 755 . 757 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 758 Requirement Levels", BCP 14, RFC 2119, 759 DOI 10.17487/RFC2119, March 1997, 760 . 762 [RFC3174] Eastlake 3rd, D. and P. Jones, "US Secure Hash Algorithm 1 763 (SHA1)", RFC 3174, DOI 10.17487/RFC3174, September 2001, 764 . 766 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 767 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 768 2003, . 770 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 771 Resource Identifier (URI): Generic Syntax", STD 66, 772 RFC 3986, DOI 10.17487/RFC3986, January 2005, 773 . 775 [RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms 776 (SHA and SHA-based HMAC and HKDF)", RFC 6234, 777 DOI 10.17487/RFC6234, May 2011, 778 . 780 10.2. Informative References 782 [ENCDEP] Tabata, K., "A Collaboration Model between Archival 783 Systems to Enhance the Reliability of Preservation by an 784 Enclose-and-Deposit Method", 2005, 785 . 787 [GRABIT] NDIIPP/CDL, , "The GrabIt File Exchange Protocol", 2008, 788 . 790 [SWORD] UKOLN/JISC CETIS, , "Simple Web-service Offering 791 Repository Deposit (SWORD)", 2008, 792 . 794 Authors' Addresses 796 John A. Kunze 797 California Digital Library 798 415 20th St, 4th Floor 799 Oakland, CA 94612 800 US 802 Email: jak@ucop.edu 803 Justin Littman 804 George Washington University Libraries 805 2130 H Street, NW 806 Washington, DC 20052 807 USA 809 Email: justinlittman@gmail.com 811 Liz Madden 812 Library of Congress 813 101 Independence Avenue SE 814 Washington, DC 20540 815 USA 817 Email: emad@loc.gov 819 Ed Summers 820 University of Maryland 821 0301 Hornbake Library 822 College Park, MD 20742-7011 823 USA 825 Email: ehs@pobox.com 827 Andy Boyko 828 1538 Winding Way 829 Belmont, CA 94002 830 USA 832 Email: andrew@boyko.net 834 Brian Vargas 835 1354 Quincy St. NW 836 Washington, DC 20011 837 USA 839 Email: brian@ardvaark.net