idnits 2.17.1 draft-kunze-bagit-13.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (January 26, 2016) is 3006 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'MSFNAM' ** Downref: Normative reference to an Informational RFC: RFC 1321 ** Downref: Normative reference to an Informational RFC: RFC 3174 Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Kunze 3 Internet-Draft California Digital Library 4 Expires: July 29, 2016 J. Littman 5 George Washington University 6 Libraries 7 L. Madden 8 Library of Congress 9 E. Summers 10 University of Maryland 11 A. Boyko 12 B. Vargas 13 January 26, 2016 15 The BagIt File Packaging Format (V0.97) 16 draft-kunze-bagit-13.txt 18 Abstract 20 This document specifies BagIt, a hierarchical file packaging format 21 for storage and transfer of arbitrary digital content. A "bag" has 22 just enough structure to enclose descriptive "tags" and a "payload" 23 but does not require knowledge of the payload's internal semantics. 24 This BagIt format should be suitable for disk-based or network-based 25 storage and transfer. BagIt is widely used in the practice of 26 digital preservation. 28 Status of this Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at http://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on July 29, 2016. 45 Copyright Notice 47 Copyright (c) 2016 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 63 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 4 64 1.2. Requirements . . . . . . . . . . . . . . . . . . . . . . . 4 65 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 66 2. Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 6 67 2.1. Required Elements . . . . . . . . . . . . . . . . . . . . 6 68 2.1.1. Bag Declaration: bagit.txt . . . . . . . . . . . . . . 6 69 2.1.2. Payload Directory: data/ . . . . . . . . . . . . . . . 6 70 2.1.3. Payload Manifest: manifest-.txt . . . . . . . . . 7 71 2.2. Optional Elements . . . . . . . . . . . . . . . . . . . . 7 72 2.2.1. Tag Manifest: tagmanifest-.txt . . . . . . . . . 7 73 2.2.2. Bag Metadata: bag-info.txt . . . . . . . . . . . . . . 8 74 2.2.3. Fetch File: fetch.txt . . . . . . . . . . . . . . . . 10 75 2.2.4. Other Tag Files . . . . . . . . . . . . . . . . . . . 11 76 2.3. Text Tag File Format . . . . . . . . . . . . . . . . . . . 11 77 2.4. Bag Checksum Algorithms . . . . . . . . . . . . . . . . . 11 78 3. Complete, Incomplete, and Valid bags . . . . . . . . . . . . . 12 79 4. Serialization . . . . . . . . . . . . . . . . . . . . . . . . 13 80 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 81 5.1. Example of a basic bag . . . . . . . . . . . . . . . . . . 14 82 5.2. Another example bag . . . . . . . . . . . . . . . . . . . 14 83 6. Security Considerations . . . . . . . . . . . . . . . . . . . 16 84 6.1. Special directory characters . . . . . . . . . . . . . . . 16 85 6.2. Control of URLs in fetch.txt . . . . . . . . . . . . . . . 16 86 6.3. File sizes in fetch.txt . . . . . . . . . . . . . . . . . 16 87 7. Practical Considerations (non-normative) . . . . . . . . . . . 17 88 7.1. Disk and network transfer . . . . . . . . . . . . . . . . 17 89 7.2. Interoperability . . . . . . . . . . . . . . . . . . . . . 17 90 7.2.1. Checksum tools . . . . . . . . . . . . . . . . . . . . 17 91 7.2.2. Windows and Unix file naming . . . . . . . . . . . . . 18 92 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19 93 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20 94 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 21 95 10.1. Normative References . . . . . . . . . . . . . . . . . . . 21 96 10.2. Informative References . . . . . . . . . . . . . . . . . . 21 98 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22 100 1. Introduction 102 1.1. Purpose 104 BagIt is a hierarchical file packaging format designed to support 105 disk-based or network-based storage and transfer of arbitrary digital 106 content. A bag consists of a "payload" and "tags". The content of 107 the payload is the custodial focus of the bag and is treated as 108 semantically opaque. The "tags" are metadata files intended to 109 facilitate and document the storage and transfer of the bag. The 110 name, BagIt, is inspired by the "enclose and deposit" method 111 [ENCDEP], sometimes referred to as "bag it and tag it". 113 BagIt is widely used for preserving digital assets originating from a 114 different domains. Organizations involved in digital preservation 115 with BagIt include the Library of Congress, Dryad Data Repository, 116 NSF DataONE, and the Rockefeller Archive Center. Software 117 implementations have been written in Python, Ruby, Java, Perl, and 118 PHP. It is also used in the libraries of many universities, such as 119 Cornell, Purdue, Stanford, Ghent University, New York University, and 120 the University of California. 122 Implementors of BagIt tools should consider interoperability between 123 different platforms, operating systems, toolsets, and languages. 124 Differences in path separators, newline characters, reserved file 125 names, and maximum path lengths are all possible barriers to moving 126 bags between different systems. Discussion of these issues may be 127 found in the Interoperability section of this document. 129 1.2. Requirements 131 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 132 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 133 document are to be interpreted as described in [RFC2119]. 135 An implementation is not compliant if it fails to satisfy one or more 136 of the MUST or REQUIRED level requirements for the protocols it 137 implements. An implementation that satisfies all the MUST or 138 REQUIRED level and all the SHOULD level requirements for its 139 protocols is said to be "unconditionally compliant"; one that 140 satisfies all the MUST level requirements but not all the SHOULD 141 level requirements for its protocols is said to be "conditionally 142 compliant." 144 1.3. Terminology 146 This specification uses a number of terms to describe BagIt, some of 147 which are in common use, some of which are newly defined by this 148 specification, and others which may have meanings obvious only to 149 those in the community from which this spec arose. Terms defined in 150 this section are intended to clarify any ambiguity. 152 bag A set of opaque data contained within the structure defined by 153 this specification. 155 bag declaration The tag file required to be in all bags conforming 156 to this specification. Contains tags necessary for bootstrapping 157 the reading and processing of the rest of a bag. See 158 Section 2.1.1. 160 bag checksum algorithm A reference to a cryptographic checksum 161 algorithm, such as MD5 or SHA-1, with its name normalized for use 162 in a manifest or tag manifest file name. See Section 2.4. 164 complete A bag which comprises all elements required by this 165 specification, with all files listed in all payload and tag 166 manifests present, all payload files present listed in at least 167 one manifest. See Section 3. 169 payload The data encapsulated by the bag. The contents of the 170 payload are opaque to this specification, and are always 171 considered as a set of octet streams. See Section 2.1.2. 173 serialized bag A bag that has been serialized into a single, 174 monolithic file. See Section 4. 176 tag directory A directory that contains one or more tag files. 178 tag file A file that contains metadata intended to facilitate and 179 document the storage and transfer of the bag. 181 valid A complete bag wherein every checksum in every payload 182 manifest and tag manifest can be successfully verified against the 183 corresponding payload file. See Section 2.1.2. 185 2. Structure 187 A bag consists of a base directory containing (1) a set of required 188 and optional tag files; (2) a sub-directory named "data", called the 189 payload directory; and (3) a set of optional tag directories. The 190 payload files in the payload directory are an arbitrary file 191 hierarchy (see Section 2.1.2). The tag files in the base directory 192 consist of one or more files named "manifest-_algorithm_.txt" (see 193 Section 2.1.3), a file named "bagit.txt" (see Section 2.1.1), and 194 zero or more additional tag files (see Section 2.2). The tag files 195 in the optional tag directories are arbitrary file hierarchies and 196 the tag directories MAY have any name that is not reserved for a file 197 or directory in this specification. 199 The base directory MAY have any name. 201 / 202 | bagit.txt 203 | manifest-.txt 204 | [optional additional tag files] 205 \--- data/ 206 | [payload files] 207 \--- [optional tag directories]/ 208 | [optional tag files] 210 2.1. Required Elements 212 2.1.1. Bag Declaration: bagit.txt 214 The "bagit.txt" tag file MUST consist of exactly two lines: 216 BagIt-Version: M.N 217 Tag-File-Character-Encoding: UTF-8 219 where M.N identifies the BagIt major (M) and minor (N) version 220 numbers, and UTF-8 identifies the character set encoding of tag 221 files. The bag declaration MUST be encoded in UTF-8, and MUST NOT 222 contain a byte-order mark (BOM). [RFC3629] 224 The appropriate version for a bag that conforms to this version of 225 the specification is "0.97". 227 2.1.2. Payload Directory: data/ 229 The base directory MUST contain a sub-directory named "data", called 230 the payload directory. 232 The payload directory contains the custodial content within the bag. 234 The files under the payload directory are called payload files, or 235 the payload. The payload is treated as octet streams for all 236 purposes relating to this specification, and is not otherwise 237 prescribed. 239 2.1.3. Payload Manifest: manifest-.txt 241 A payload manifest is a tag file that lists payload files and 242 checksums for those payload files generated using a particular bag 243 checksum algorithm. Every bag MUST contain one payload manifest 244 file, and MAY contain more than one. A payload manifest file MUST 245 have a name of the form manifest-_algorithm_.txt, where _algorithm_ 246 is a string specifying the bag checksum algorithm used in that 247 manifest, such as: 249 manifest-md5.txt 250 manifest-sha1.txt 252 A bag MUST NOT contain more than one payload manifest for a 253 particular bag checksum algorithm. 255 Each line of a payload manifest file MUST be of the form: 257 CHECKSUM FILENAME 259 where FILENAME is the pathname of a file relative to the base 260 directory and CHECKSUM is a hex-encoded checksum calculated according 261 to _algorithm_ over every octet in the file. The hex-encoded 262 checksum MAY use uppercase and/or lowercase letters. The slash 263 character ('/') MUST be used as a path separator in FILENAME. One or 264 more linear whitespace characters (spaces or tabs) MUST separate 265 CHECKSUM from FILENAME. An asterisk ('*') MAY preceed FILENAME for 266 interoperability on some platforms (see Section 7.2.1). There is no 267 limitation on the length of a pathname. The payload manifest MUST 268 NOT reference files outside the payload directory. 270 Payload manifests only include the pathnames of files. Because of 271 this, a payload manifest cannot reference empty directories. To 272 account for an empty directory, a bag creator may wish to include at 273 least one file in that directory; it suffices, for example, to 274 include a zero-length file named ".keep". 276 2.2. Optional Elements 278 2.2.1. Tag Manifest: tagmanifest-.txt 280 A tag manifest is a tag file that lists other tag files and checksums 281 for those tag files generated using a particular bag checksum 282 algorithm. A bag MAY contain one or more tag manifests. A tag 283 manifest file MUST have a name of the form "tagmanifest- 284 _algorithm_.txt", where _algorithm_ is a string specifying the bag 285 checksum algorithm used in that manifest, such as: 287 tagmanifest-md5.txt 288 tagmanifest-sha1.txt 290 A tag manifest file has the same form as the payload file manifest 291 file described in Section 2.1.3, but MUST NOT list any payload files. 292 As a result, no FILENAME listed in a tag manifest begins "data/". 294 2.2.2. Bag Metadata: bag-info.txt 296 The "bag-info.txt" file is a tag file that contains metadata elements 297 describing the bag and the payload. The metadata elements contained 298 in the "bag-info.txt" file are intended primarily for human 299 readability. All metadata elements are optional and MAY be repeated. 300 Implementations SHOULD assume that the ordering is significant and 301 provide access to the metadata elements in the order they are given 302 in the "bag-info.txt" file. 304 A metadata element MUST consist of a label, a colon, and a value, 305 each separated by optional whitespace. It is RECOMMENDED that lines 306 not exceed 79 characters in length. Long values may be continued 307 onto the next line by inserting a newline (LF), a carriage return 308 (CR), or carriage return plus newline (CRLF) and indenting the next 309 line with linear white space (spaces or tabs). 311 Reserved metadata element names are case-insensitive and defined as 312 follows. 314 Source-Organization Organization transferring the content. 316 Organization-Address Mailing address of the organization. 318 Contact-Name Person at the source organization who is responsible 319 for the content transfer. 321 Contact-Phone International format telephone number of person or 322 position responsible. 324 Contact-Email Fully qualified email address of person or position 325 responsible. 327 External-Description A brief explanation of the contents and 328 provenance. 330 Bagging-Date Date (YYYY-MM-DD) that the content was prepared for 331 delivery. 333 External-Identifier A sender-supplied identifier for the bag. 335 Bag-Size Size or approximate size of the bag being transferred, 336 followed by an abbreviation such as MB (megabytes), GB, or TB; for 337 example, 42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum 338 (described next), Bag-Size is intended for human consumption. 340 Payload-Oxum The "octetstream sum" of the payload, namely, a two- 341 part number of the form "OctetCount.StreamCount", where OctetCount 342 is the total number of octets (8-bit bytes) across all payload 343 file content and StreamCount is the total number of payload files. 344 Payload-Oxum should be included in "bag-info.txt" if at all 345 possible. Compared to Bag-Size (above), Payload-Oxum is intended 346 for machine consumption. 348 Bag-Group-Identifier A sender-supplied identifier for the set, if 349 any, of bags to which it logically belongs. This identifier must 350 be unique across the sender's content, and if recognizable as 351 belonging to a globally unique scheme, the receiver should make an 352 effort to honor reference to it. 354 Bag-Count Two numbers separated by "of", in particular, "N of T", 355 where T is the total number of bags in a group of bags and N is 356 the ordinal number within the group; if T is not known, specify it 357 as "?" (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 358 145. 360 Internal-Sender-Identifier An alternate sender-specific identifier 361 for the content and/or bag. 363 Internal-Sender-Description A sender-local prose description of the 364 contents of the bag. 366 In addition to these metadata elements, other arbitrary metadata 367 elements may also be present. 369 Here is an example "bag-info.txt" file. 371 Source-Organization: Spengler University 372 Organization-Address: 1400 Elm St., Cupertino, California, 95014 373 Contact-Name: Edna Janssen 374 Contact-Phone: +1 408-555-1212 375 Contact-Email: ej@spengler.edu 376 External-Description: Uncompressed greyscale TIFF images from the 377 Yoshimuri papers colle... 378 Bagging-Date: 2008-01-15 379 External-Identifier: spengler_yoshimuri_001 380 Bag-Size: 260 GB 381 Payload-Oxum: 279164409832.1198 382 Bag-Group-Identifier: spengler_yoshimuri 383 Bag-Count: 1 of 15 384 Internal-Sender-Identifier: /storage/images/yoshimuri 385 Internal-Sender-Description: Uncompressed greyscale TIFFs created 386 from microfilm and are... 388 2.2.3. Fetch File: fetch.txt 390 For reasons of efficiency, a bag MAY be sent with a list of files to 391 be fetched and added to the payload before it can meaningfully be 392 checked for completeness. An OPTIONAL tag file named "fetch.txt" 393 contains such a list. Each line of "fetch.txt" has the form 395 URL LENGTH FILENAME 397 where URL identifies the file to be fetched, LENGTH is the number of 398 octets in the file (or "-", to leave it unspecified), and FILENAME 399 identifies the corresponding payload file, relative to the base 400 directory. The slash character ('/') MUST be used as a path 401 separator in FILENAME. If FILENAME begins with a slash character, 402 the destination MUST still be treated as relative to the bag base 403 directory. One or more linear whitespace characters (spaces or tabs) 404 MUST separate these three values, and any such characters in the URL 405 MUST be percent-encoded [RFC3986]. There is no limitation on the 406 length of any of the fields in the "fetch.txt". 408 The "fetch.txt" file allows a bag to be transmitted with "holes" in 409 it, which can be practical for several reasons. For example, it 410 obviates the need for the sender to stage a large serialized copy of 411 the content while the bag is transferred to the receiver. Also, this 412 method allows a sender to construct a bag from components that are 413 either a subset of logically related components (e.g., the localized 414 logical object could be much larger than what is intended for export) 415 or assembled from logically distributed sources (e.g., the object 416 components for export are not stored locally under one filesystem 417 tree). 419 2.2.4. Other Tag Files 421 A bag MAY contain other tag files that are not defined by this 422 specification. Implementations SHOULD ignore the content of any 423 unexpected tag files, except when they are listed in a tag manifest. 424 When unexpected tag files are listed in a tag manifest, 425 implementations MUST only treat the content of those tag files as 426 octet streams for the purpose of checksum verification. 428 2.3. Text Tag File Format 430 All tag files specifically described in this specification MUST 431 adhere to the text tag file format described below. Other tag files 432 MAY adhere to the text tag file format described below. 434 Text tag files are line-oriented, and each line MUST be terminated by 435 a newline (LF), a carriage return (CR), or carriage return plus 436 newline (CRLF). Text tag files MUST end in the extension ".txt". 438 In all text tag files except for the bag declaration file, text MUST 439 be encoded in the character encoding specified in the "bagit.txt" bag 440 declaration file. Text tag files except for the bag declaration file 441 MAY include a byte-order mark (BOM) only if the specified encoding 442 requires it for proper decoding. (Note that UTF-8 does not.) 444 As specified in Section 2.1.1, the bag declaration file must be 445 encoded in UTF-8 and must not include a byte-order mark. 447 2.4. Bag Checksum Algorithms 449 The payload manifest and tag manifests assert integrity of the 450 payload and tags in a bag using checksum algorithms. The operation 451 of those algorithms, and the formatting of their output within a 452 manifest file, are generally beyond the scope of this specification, 453 except that the output format MUST be able to fit in the manifest 454 format specified in Section 2.1.3. 456 The name of the checksum algorithm MUST be normalized for use in the 457 manifest's filename by lowercasing the common name of the algorithm 458 and removing all non-alphanumeric characters. 460 Implementors of tools that create and validate bags SHOULD support at 461 least two widely implemented checksum algorithms: "md5" [RFC1321] and 462 "sha1" [RFC3174]. The authors recognize that these two algorithms 463 now have well-known vulnerabilities that render them inadequate for 464 applications requiring secure change detection. 466 3. Complete, Incomplete, and Valid bags 468 A _complete_ bag MUST have the following attributes: 470 1. Every required element MUST be present (Section 2.1). 472 2. Every file in every payload manifest MUST be present. 474 3. Every file in every tag manifest MUST be present. Tag files not 475 listed in a tag manifest MAY be present. 477 4. Every payload file MUST be listed in at least one manifest. 478 Payload files MAY be listed in more than one payload manifest. 480 5. Every element present MUST comply with this specification. 482 A bag is _incomplete_ when it exhibits any of the following 483 exceptions to the attributes of a complete bag: 485 1. One or more files in any payload manifest are absent. 487 2. One or more files in any tag manifest are absent. 489 3. A fetch.txt is present. Any files listed in any payload manifest 490 or any tag manifest which are absent MUST be listed in the 491 fetch.txt. 493 A _valid_ bag must have the following attributes: 495 1. The bag MUST be complete. 497 2. Every CHECKSUM in every payload manifest and tag manifest can be 498 sucessfully verified against the contents of its corresponding 499 FILENAME. 501 If a bag is neither valid, complete, nor incomplete, it is _invalid_. 502 Definitions for the various ways a bag may be invalid are not covered 503 by this specification. 505 Tag files that do not appear in a tag manifest can be modified, added 506 to, or removed from a bag without impacting the completeness or 507 validity of the bag. 509 4. Serialization 511 In some scenarios, it may be convenient to serialize the bag's 512 filesystem hierarchy (i.e., the base directory) into a single-file 513 archive format such as TAR or ZIP (the serialization) and then later 514 deserialize the serialization to recreate the filesystem hierarchy. 515 Several rules govern the serialization of a bag and apply equally to 516 all types of archive files: 518 1. The top-level directory of a serialization MUST contain only one 519 bag. 521 2. The serialization SHOULD have the same name as the bag's base 522 directory, but MUST have an extension added to identify the 523 format. For example, the receiver of "mybag.tar.gz" expects the 524 corresponding base directory to be created as "mybag". 526 3. A bag MUST NOT be serialized from within its base directory, but 527 from the parent of the base directory (where the base directory 528 appears as an entry). Thus, after a bag is deserialized in an 529 empty directory, a listing of that directory shows exactly one 530 entry. For example, deserializing "mybag.zip" in an empty 531 directory causes the creation of the base directory "mybag" and, 532 beneath "mybag", the creation of all payload and tag files. 534 4. The deserialization of a bag MUST produce a single base directory 535 bag with the top-level structure as described in this 536 specification without requiring any additional un-archiving step. 537 For example, after one un-archiving step it would be an error for 538 the "data/" directory to appear as "data.tar.gz". TAR and ZIP 539 files may appear inside the payload beneath the "data/" 540 directory, where they would be treated as any other payload file. 542 When serializing a bag, care must be taken to ensure that the archive 543 format's restrictions on file naming, such as allowable characters, 544 length, or character encoding, will support the requirements of the 545 systems on which it will be used. See Section 7.2. 547 5. Examples 549 5.1. Example of a basic bag 551 This is the layout of a basic bag containing an image and a companion 552 OCR file. Lines of file content are shown in parentheses beneath the 553 file name. 555 myfirstbag/ 556 | 557 | manifest-md5.txt 558 | (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png) 559 | (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt) 560 | 561 | bagit.txt 562 | (BagIt-version: 0.96 ) 563 | (Tag-File-Character-Encoding: UTF-8 ) 564 | 565 \--- data/ 566 | 567 | 27613-h/images/q172.png 568 | (... image bytes ... ) 569 | 570 | 27613-h/images/q172.txt 571 | (... OCR text ... ) 572 .... 574 5.2. Another example bag 576 The following example bag contains content from a web crawler. As 577 before, lines of file content are shown in parentheses beneath the 578 file name, with long lines continued indented on subsequent lines. 579 This bag is not complete until every component listed in the 580 "fetch.txt" file is retrieved. 582 mysecondbag/ 583 | 584 | manifest-md5.txt 585 | (93c53193ef96732c76e00b3fdd8f9dd3 data/Collection Overview.txt ) 586 | (e9c5753d65b1ef5aeb281c0bb880c6c8 data/Seed List.txt ) 587 | (61c96810788283dc7be157b340e4eff4 data/gov-20060601-050019.arc.gz) 588 | (55c7c80c6635d5a4c8fe76a940bf353e data/gov-20060601-100002.arc.gz) 589 | 590 | fetch.txt 591 | (http://WB20.Stanford.Edu/gov-06-2006/gov-20060601-050019.arc.gz 592 | 26583985 data/gov-20060601-050019.arc.gz ) 593 | (http://WB20.Stanford.Edu/gov-06-2006/gov-20060601-100002.arc.gz 594 | 99509720 data/gov-20060601-100002.arc.gz ) 595 | ( ...............................................................) 596 | 597 | bag-info.txt 598 | (Source-organization: California Digital Library ) 599 | (Organization-address: 415 20th St, 4th Floor, Oakland, CA 94612) 600 | (Contact-name: A. E. Newman ) 601 | (Contact-phone: +1 510-555-1234 ) 602 | (Contact-email: alfred@ucop.edu ) 603 | (External-Description: The collection "Local Davis Flood Control ) 604 | Collection" includes captured California State and local ) 605 | websites containing information on flood control resources for ) 606 | the Davis and Sacramento area. Sites were captured by UC Davis) 607 | curator Wrigley Spyder using the Web Archiving Service in ) 608 | February 2007 and October 2007. ) 609 | (Bag-date: 2008.04.15 ) 610 | (External-identifier: ark:/13030/fk4jm2bcp ) 611 | (Bag-size: about 22Gb ) 612 | (Payload-Oxum: 21836794142.831 ) 613 | (Internal-sender-identifier: UCDL ) 614 | (Internal-sender-description: UC Davis Libraries ) 615 | 616 | bagit.txt 617 | (BagIt-version: 0.96 ) 618 | (Tag-File-Character-Encoding: UTF-8 ) 619 | 620 \--- data/ 621 | 622 | Collection Overview.txt 623 | (... narrative description ... ) 624 | 625 | Seed List.txt 626 | (... list of crawler starting point URLs ... ) 627 .... 629 6. Security Considerations 631 6.1. Special directory characters 633 The paths specified in the payload manifest, tag manifest, and 634 "fetch.txt" file do not prohibit special directory characters which 635 might be significant on implementing systems. Implementors SHOULD 636 take care that files outside the bag directory structure are not 637 accessed when reading or writing files based on paths specified in a 638 bag. 640 For example, path characters such as ".." or "~" in a maliciously 641 crafted "fetch.txt" file might cause a naive implementation to 642 overwrite critical system files. 644 6.2. Control of URLs in fetch.txt 646 Implementors of tools that complete bags by retrieving URLs listed in 647 a "fetch.txt" file need to be aware that some of those URLs may point 648 to hosts, intentionally or unintentionally, that are not under 649 control of the bag's sender. Checksums are intended as a reasonable 650 guarantee against corruption during transit, not a strong 651 cryptographic protection against intentional spoofing. 653 6.3. File sizes in fetch.txt 655 The size of files, as optionally reported in the "fetch.txt" file, 656 cannot be guaranteed to match the actual file size to be downloaded. 657 Implementors SHOULD take care to appropriately handle cases where the 658 actual file size does not match the file size reported in the 659 fetch.txt. Implementors SHOULD NOT use the file size in the 660 "fetch.txt" file for critical resource allocation, such as buffer 661 sizing or storage requisitioning. 663 7. Practical Considerations (non-normative) 665 7.1. Disk and network transfer 667 When creating a bag on physical media (such as hard disk, CD-ROM, or 668 DVD) for transfer to another organization, the sender should select 669 and format the media in a manner compatible with both the content 670 requirements (e.g., file names and sizes) and the receiver's 671 technical infrastructure. If the receiver's infrastructure is not 672 known or the media needs to be compatible with a range of potential 673 receivers, consideration should be given to portability and common 674 usage. For example, a "lowest common denominator" for some potential 675 receivers could be USB disk drives formatted with the FAT32 676 filesystem. 678 Although overall bag size is unlimited in principle, network-based 679 transfers may involve constraints on the amount of bag data that a 680 receiver can receive at one time. It may be practical to split a 681 large bag into several smaller bags. 683 Transmitting a whole bag in serialized form as a single file will 684 tend to be the most straightforward mode of transfer. When 685 throughput is a priority, use of "fetch.txt" lends itself to an easy, 686 application-level parallelism in which the list of URL-addressed 687 items to fetch is divided among multiple processes. The mechanics of 688 sending and receiving bags over networks is otherwise out of scope of 689 the present document and may be facilitated by protocols such as 690 [GRABIT] and [SWORD]. 692 7.2. Interoperability 694 This section is not part of the BagIt specification. It describes 695 some practical considerations for bag creators and receivers circa 696 2010. 698 7.2.1. Checksum tools 700 Some cautions regarding bag interchange arise in regard to the 701 commonly available checksum tools distributed with the GNU Coreutils 702 package (md5sum, sha1sum, etc.), collectively referred to here as 703 "md5sum". First, md5sum can be run in binary or text mode; text mode 704 sometimes normalizes line-endings. While these modes appear to 705 produce the same checksums under Unix-like systems, they can produce 706 different checksums under Windows. When using md5sum, it may be 707 safest to run it in binary mode, with one caveat: a side-effect of 708 binary mode is that md5sum requires a space and an asterisk ('*'), 709 compared to two spaces in text mode, between the CHECKSUM and 710 FILENAME in its manifest format. 712 Due to the widespread use of md5sum (and its relatives), it is not 713 unexpected for bag receivers to see manifests in which CHECKSUM and 714 FILENAME are separated by a space followed by an asterisk. 715 Implementors creating or processing bags with md5sum should be aware 716 of these subtle differences, and ensure compliance with the manifest 717 specification in this document. Implementors creating and processing 718 bags with other tools may wish to be tolerant of asterisks found in 719 the manifests. 721 A final note about md5sum-generated manifests is that for a FILENAME 722 containing a backslash ('\'), the manifest line will have a backslash 723 inserted in front of the CHECKSUM and, under Windows, the backslashes 724 inside FILENAME may be doubled. 726 7.2.2. Windows and Unix file naming 728 As specified above, only the Unix-based path separator ('/') may be 729 used inside filenames listed in BagIt manifests and "fetch.txt" 730 files. When bags are exchanged between Windows and Unix platforms, 731 care should be taken to translate the path separator as needed. 732 Receivers of bags on physical media should be prepared for 733 filesystems created under either Windows or Unix. Besides the 734 fundamental difference between path separators ('\' and '/'), 735 generally, Windows filesystems have more limitations than Unix 736 filesystems. Windows path names have a maximum of 255 characters, 737 and none of these characters may be used in a path component: 739 < > : " / | ? * 741 Windows also reserves the following names: CON, PRN, AUX, NUL, COM1, 742 COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, 743 LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. See [MSFNAM] for more 744 information. 746 8. Acknowledgements 748 BagIt owes much to many thoughtful contributers and reviewers, 749 including Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Brad Hards, 750 Scott Fisher, Keith Johnson, Erik Hetzner, Leslie Johnston, David 751 Loy, Mark Phillips, Tracy Seneca, Brian Tingle, Adam Turoff, and Jim 752 Tuttle. 754 9. IANA Considerations 756 This draft does not request any action from IANA. 758 10. References 760 10.1. Normative References 762 [MSFNAM] Microsoft, "Naming a File", 2008, 763 . 765 [RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321, 766 DOI 10.17487/RFC1321, April 1992, 767 . 769 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 770 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ 771 RFC2119, March 1997, 772 . 774 [RFC3174] Eastlake 3rd, D. and P. Jones, "US Secure Hash Algorithm 1 775 (SHA1)", RFC 3174, DOI 10.17487/RFC3174, September 2001, 776 . 778 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 779 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, 780 November 2003, . 782 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 783 Resource Identifier (URI): Generic Syntax", STD 66, 784 RFC 3986, DOI 10.17487/RFC3986, January 2005, 785 . 787 10.2. Informative References 789 [ENCDEP] Tabata, K., "A Collaboration Model between Archival 790 Systems to Enhance the Reliability of Preservation by an 791 Enclose-and-Deposit Method", 2005, 792 . 794 [GRABIT] NDIIPP/CDL, "The GrabIt File Exchange Protocol", 2008, 795 . 797 [SWORD] UKOLN/JISC CETIS, "Simple Web-service Offering Repository 798 Deposit (SWORD)", 2008, 799 . 801 Authors' Addresses 803 John A. Kunze 804 California Digital Library 805 415 20th St, 4th Floor 806 Oakland, CA 94612 807 US 809 Email: jak@ucop.edu 811 Justin Littman 812 George Washington University Libraries 813 2130 H Street, NW 814 Washington, DC 20052 815 USA 817 Email: justinlittman@gmail.com 819 Liz Madden 820 Library of Congress 821 101 Independence Avenue SE 822 Washington, DC 20540 823 USA 825 Email: emad@loc.gov 827 Ed Summers 828 University of Maryland 829 0301 Hornbake Library 830 College Park, MD 20742-7011 831 USA 833 Email: ehs@pobox.com 835 Andy Boyko 836 1538 Winding Way 837 Belmont, CA 94002 838 USA 840 Email: andrew@boyko.net 841 Brian Vargas 842 1354 Quincy St. NW 843 Washington, DC 20011 844 USA 846 Email: brian@ardvaark.net