idnits 2.17.1 draft-kunze-bagit-11.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 23, 2015) is 3230 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'ENCDEP' -- Possible downref: Non-RFC (?) normative reference: ref. 'GRABIT' -- Possible downref: Non-RFC (?) normative reference: ref. 'MSFNAM' ** Downref: Normative reference to an Informational RFC: RFC 1321 ** Downref: Normative reference to an Informational RFC: RFC 3174 -- Possible downref: Non-RFC (?) normative reference: ref. 'SWORD' Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Kunze 3 Internet-Draft California Digital Library 4 Expires: December 25, 2015 J. Littman 5 Goerge Washington University 6 Libraries 7 L. Madden 8 Library of Congress 9 E. Summers 10 University of Maryland 11 A. Boyko 12 B. Vargas 13 June 23, 2015 15 The BagIt File Packaging Format (V0.97) 16 draft-kunze-bagit-11.txt 18 Abstract 20 This document specifies BagIt, a hierarchical file packaging format 21 for storage and transfer of arbitrary digital content. A "bag" has 22 just enough structure to enclose descriptive "tags" and a "payload" 23 but does not require knowledge of the payload's internal semantics. 24 This BagIt format should be suitable for disk-based or network-based 25 storage and transfer. 27 Status of this Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on December 25, 2015. 44 Copyright Notice 46 Copyright (c) 2015 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 3 63 1.2. Requirements . . . . . . . . . . . . . . . . . . . . . . . 3 64 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 65 2. Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 5 66 2.1. Required Elements . . . . . . . . . . . . . . . . . . . . 5 67 2.1.1. Bag Declaration: bagit.txt . . . . . . . . . . . . . . 5 68 2.1.2. Payload Directory: data/ . . . . . . . . . . . . . . . 5 69 2.1.3. Payload Manifest: manifest-.txt . . . . . . . . . 6 70 2.2. Optional Elements . . . . . . . . . . . . . . . . . . . . 6 71 2.2.1. Tag Manifest: tagmanifest-.txt . . . . . . . . . 6 72 2.2.2. Bag Metadata: bag-info.txt . . . . . . . . . . . . . . 7 73 2.2.3. Fetch File: fetch.txt . . . . . . . . . . . . . . . . 9 74 2.2.4. Other Tag Files . . . . . . . . . . . . . . . . . . . 10 75 2.3. Text Tag File Format . . . . . . . . . . . . . . . . . . . 10 76 2.4. Bag Checksum Algorithms . . . . . . . . . . . . . . . . . 10 77 3. Complete, Incomplete, and Valid bags . . . . . . . . . . . . . 11 78 4. Serialization . . . . . . . . . . . . . . . . . . . . . . . . 12 79 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 80 5.1. Example of a basic bag . . . . . . . . . . . . . . . . . . 13 81 5.2. Another example bag . . . . . . . . . . . . . . . . . . . 13 82 6. Security Considerations . . . . . . . . . . . . . . . . . . . 15 83 6.1. Special directory characters . . . . . . . . . . . . . . . 15 84 6.2. Control of URLs in fetch.txt . . . . . . . . . . . . . . . 15 85 6.3. File sizes in fetch.txt . . . . . . . . . . . . . . . . . 15 86 7. Practical Considerations (non-normative) . . . . . . . . . . . 16 87 7.1. Disk and network transfer . . . . . . . . . . . . . . . . 16 88 7.2. Interoperability . . . . . . . . . . . . . . . . . . . . . 16 89 7.2.1. Checksum tools . . . . . . . . . . . . . . . . . . . . 16 90 7.2.2. Windows and Unix file naming . . . . . . . . . . . . . 17 91 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 18 92 8.1. IANA Considerations . . . . . . . . . . . . . . . . . . . 18 93 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 19 94 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20 96 1. Introduction 98 1.1. Purpose 100 BagIt is a hierarchical file packaging format designed to support 101 disk-based or network-based storage and transfer of arbitrary digital 102 content. A bag consists of a "payload" and "tags". The content of 103 the payload is the custodial focus of the bag and is treated as 104 semantically opaque. The "tags" are metadata files intended to 105 facilitate and document the storage and transfer of the bag. The 106 name, BagIt, is inspired by the "enclose and deposit" method 107 [ENCDEP], sometimes referred to as "bag it and tag it". 109 Implementors of BagIt tools should consider interoperability between 110 different platforms, operating systems, toolsets, and languages. 111 Differences in path separators, newline characters, reserved file 112 names, and maximum path lengths are all possible barriers to moving 113 bags between different systems. Discussion of these issues may be 114 found in the Interoperability section of this document. 116 1.2. Requirements 118 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 119 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 120 document are to be interpreted as described in [RFC2119]. 122 An implementation is not compliant if it fails to satisfy one or more 123 of the MUST or REQUIRED level requirements for the protocols it 124 implements. An implementation that satisfies all the MUST or 125 REQUIRED level and all the SHOULD level requirements for its 126 protocols is said to be "unconditionally compliant"; one that 127 satisfies all the MUST level requirements but not all the SHOULD 128 level requirements for its protocols is said to be "conditionally 129 compliant." 131 1.3. Terminology 133 This specification uses a number of terms to describe BagIt, some of 134 which are in common use, some of which are newly defined by this 135 specification, and others which may have meanings obvious only to 136 those in the community from which this spec arose. Terms defined in 137 this section are intended to clarify any ambiguity. 139 bag A set of opaque data contained within the structure defined by 140 this specification. 142 bag declaration The tag file required to be in all bags conforming 143 to this specification. Contains tags necessary for bootstrapping 144 the reading and processing of the rest of a bag. See 145 Section 2.1.1. 147 bag checksum algorithm A reference to a cryptographic checksum 148 algorithm, such as MD5 or SHA-1, with its name normalized for use 149 in a manifest or tag manifest file name. See Section 2.4. 151 complete A bag which comprises all elements required by this 152 specification, with all files listed in all payload and tag 153 manifests present, all payload files present listed in at least 154 one manifest. See Section 3. 156 payload The data encapsulated by the bag. The contents of the 157 payload are opaque to this specification, and are always 158 considered as a set of octet streams. See Section 2.1.2. 160 serialized bag A bag that has been serialized into a single, 161 monolithic file. See Section 4. 163 tag directory A directory that contains one or more tag files. 165 tag file A file that contains metadata intended to facilitate and 166 document the storage and transfer of the bag. 168 valid A complete bag wherein every checksum in every payload 169 manifest and tag manifest can be successfully verified against the 170 corresponding payload file. See Section 2.1.2. 172 2. Structure 174 A bag consists of a base directory containing (1) a set of required 175 and optional tag files; (2) a sub-directory named "data", called the 176 payload directory; and (3) a set of optional tag directories. The 177 payload files in the payload directory are an arbitrary file 178 hierarchy (see Section 2.1.2). The tag files in the base directory 179 consist of one or more files named "manifest-_algorithm_.txt" (see 180 Section 2.1.3), a file named "bagit.txt" (see Section 2.1.1), and 181 zero or more additional tag files (see Section 2.2). The tag files 182 in the optional tag directories are arbitrary file hierarchies and 183 the tag directories MAY have any name that is not reserved for a file 184 or directory in this specification. 186 The base directory MAY have any name. 188 / 189 | bagit.txt 190 | manifest-.txt 191 | [optional additional tag files] 192 \--- data/ 193 | [payload files] 194 \--- [optional tag directories]/ 195 | [optional tag files] 197 2.1. Required Elements 199 2.1.1. Bag Declaration: bagit.txt 201 The "bagit.txt" tag file MUST consist of exactly two lines: 203 BagIt-Version: M.N 204 Tag-File-Character-Encoding: UTF-8 206 where M.N identifies the BagIt major (M) and minor (N) version 207 numbers, and UTF-8 identifies the character set encoding of tag 208 files. The bag declaration MUST be encoded in UTF-8, and MUST NOT 209 contain a byte-order mark (BOM). [RFC3629] 211 The appropriate version for a bag that conforms to this version of 212 the specification is "0.97". 214 2.1.2. Payload Directory: data/ 216 The base directory MUST contain a sub-directory named "data", called 217 the payload directory. 219 The payload directory contains the custodial content within the bag. 221 The files under the payload directory are called payload files, or 222 the payload. The payload is treated as octet streams for all 223 purposes relating to this specification, and is not otherwise 224 prescribed. 226 2.1.3. Payload Manifest: manifest-.txt 228 A payload manifest is a tag file that lists payload files and 229 checksums for those payload files generated using a particular bag 230 checksum algorithm. Every bag MUST contain one payload manifest 231 file, and MAY contain more than one. A payload manifest file MUST 232 have a name of the form manifest-_algorithm_.txt, where _algorithm_ 233 is a string specifying the bag checksum algorithm used in that 234 manifest, such as: 236 manifest-md5.txt 237 manifest-sha1.txt 239 A bag MUST NOT contain more than one payload manifest for a 240 particular bag checksum algorithm. 242 Each line of a payload manifest file MUST be of the form: 244 CHECKSUM FILENAME 246 where FILENAME is the pathname of a file relative to the base 247 directory and CHECKSUM is a hex-encoded checksum calculated according 248 to _algorithm_ over every octet in the file. The hex-encoded 249 checksum MAY use uppercase and/or lowercase letters. The slash 250 character ('/') MUST be used as a path separator in FILENAME. One or 251 more linear whitespace characters (spaces or tabs) MUST separate 252 CHECKSUM from FILENAME. An asterisk ('*') MAY preceed FILENAME for 253 interoperability on some platforms (see Section 7.2.1). There is no 254 limitation on the length of a pathname. The payload manifest MUST 255 NOT reference files outside the payload directory. 257 Payload manifests only include the pathnames of files. Because of 258 this, a payload manifest cannot reference empty directories. To 259 account for an empty directory, a bag creator may wish to include at 260 least one file in that directory; it suffices, for example, to 261 include a zero-length file named ".keep". 263 2.2. Optional Elements 265 2.2.1. Tag Manifest: tagmanifest-.txt 267 A tag manifest is a tag file that lists other tag files and checksums 268 for those tag files generated using a particular bag checksum 269 algorithm. A bag MAY contain one or more tag manifests. A tag 270 manifest file MUST have a name of the form "tagmanifest- 271 _algorithm_.txt", where _algorithm_ is a string specifying the bag 272 checksum algorithm used in that manifest, such as: 274 tagmanifest-md5.txt 275 tagmanifest-sha1.txt 277 A tag manifest file has the same form as the payload file manifest 278 file described in Section 2.1.3, but MUST NOT list any payload files. 279 As a result, no FILENAME listed in a tag manifest begins "data/". 281 2.2.2. Bag Metadata: bag-info.txt 283 The "bag-info.txt" file is a tag file that contains metadata elements 284 describing the bag and the payload. The metadata elements contained 285 in the "bag-info.txt" file are intended primarily for human 286 readability. All metadata elements are optional and MAY be repeated. 287 Implementations SHOULD assume that the ordering is significant and 288 provide access to the metadata elements in the order they are given 289 in the "bag-info.txt" file. 291 A metadata element MUST consist of a label, a colon, and a value, 292 each separated by optional whitespace. It is RECOMMENDED that lines 293 not exceed 79 characters in length. Long values may be continued 294 onto the next line by inserting a newline (LF), a carriage return 295 (CR), or carriage return plus newline (CRLF) and indenting the next 296 line with linear white space (spaces or tabs). 298 Reserved metadata element names are case-insensitive and defined as 299 follows. 301 Source-Organization Organization transferring the content. 303 Organization-Address Mailing address of the organization. 305 Contact-Name Person at the source organization who is responsible 306 for the content transfer. 308 Contact-Phone International format telephone number of person or 309 position responsible. 311 Contact-Email Fully qualified email address of person or position 312 responsible. 314 External-Description A brief explanation of the contents and 315 provenance. 317 Bagging-Date Date (YYYY-MM-DD) that the content was prepared for 318 delivery. 320 External-Identifier A sender-supplied identifier for the bag. 322 Bag-Size Size or approximate size of the bag being transferred, 323 followed by an abbreviation such as MB (megabytes), GB, or TB; for 324 example, 42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum 325 (described next), Bag-Size is intended for human consumption. 327 Payload-Oxum The "octetstream sum" of the payload, namely, a two- 328 part number of the form "OctetCount.StreamCount", where OctetCount 329 is the total number of octets (8-bit bytes) across all payload 330 file content and StreamCount is the total number of payload files. 331 Payload-Oxum should be included in "bag-info.txt" if at all 332 possible. Compared to Bag-Size (above), Payload-Oxum is intended 333 for machine consumption. 335 Bag-Group-Identifier A sender-supplied identifier for the set, if 336 any, of bags to which it logically belongs. This identifier must 337 be unique across the sender's content, and if recognizable as 338 belonging to a globally unique scheme, the receiver should make an 339 effort to honor reference to it. 341 Bag-Count Two numbers separated by "of", in particular, "N of T", 342 where T is the total number of bags in a group of bags and N is 343 the ordinal number within the group; if T is not known, specify it 344 as "?" (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 345 145. 347 Internal-Sender-Identifier An alternate sender-specific identifier 348 for the content and/or bag. 350 Internal-Sender-Description A sender-local prose description of the 351 contents of the bag. 353 In addition to these metadata elements, other arbitrary metadata 354 elements may also be present. 356 Here is an example "bag-info.txt" file. 358 Source-Organization: Spengler University 359 Organization-Address: 1400 Elm St., Cupertino, California, 95014 360 Contact-Name: Edna Janssen 361 Contact-Phone: +1 408-555-1212 362 Contact-Email: ej@spengler.edu 363 External-Description: Uncompressed greyscale TIFF images from the 364 Yoshimuri papers colle... 365 Bagging-Date: 2008-01-15 366 External-Identifier: spengler_yoshimuri_001 367 Bag-Size: 260 GB 368 Payload-Oxum: 279164409832.1198 369 Bag-Group-Identifier: spengler_yoshimuri 370 Bag-Count: 1 of 15 371 Internal-Sender-Identifier: /storage/images/yoshimuri 372 Internal-Sender-Description: Uncompressed greyscale TIFFs created 373 from microfilm and are... 375 2.2.3. Fetch File: fetch.txt 377 For reasons of efficiency, a bag MAY be sent with a list of files to 378 be fetched and added to the payload before it can meaningfully be 379 checked for completeness. An OPTIONAL tag file named "fetch.txt" 380 contains such a list. Each line of "fetch.txt" has the form 382 URL LENGTH FILENAME 384 where URL identifies the file to be fetched, LENGTH is the number of 385 octets in the file (or "-", to leave it unspecified), and FILENAME 386 identifies the corresponding payload file, relative to the base 387 directory. The slash character ('/') MUST be used as a path 388 separator in FILENAME. If FILENAME begins with a slash character, 389 the destination MUST still be treated as relative to the bag base 390 directory. One or more linear whitespace characters (spaces or tabs) 391 MUST separate these three values, and any such characters in the URL 392 MUST be percent-encoded [RFC3986]. There is no limitation on the 393 length of any of the fields in the "fetch.txt". 395 The "fetch.txt" file allows a bag to be transmitted with "holes" in 396 it, which can be practical for several reasons. For example, it 397 obviates the need for the sender to stage a large serialized copy of 398 the content while the bag is transferred to the receiver. Also, this 399 method allows a sender to construct a bag from components that are 400 either a subset of logically related components (e.g., the localized 401 logical object could be much larger than what is intended for export) 402 or assembled from logically distributed sources (e.g., the object 403 components for export are not stored locally under one filesystem 404 tree). 406 2.2.4. Other Tag Files 408 A bag MAY contain other tag files that are not defined by this 409 specification. Implementations SHOULD ignore the content of any 410 unexpected tag files, except when they are listed in a tag manifest. 411 When unexpected tag files are listed in a tag manifest, 412 implementations MUST only treat the content of those tag files as 413 octet streams for the purpose of checksum verification. 415 2.3. Text Tag File Format 417 All tag files specifically described in this specification MUST 418 adhere to the text tag file format described below. Other tag files 419 MAY adhere to the text tag file format described below. 421 Text tag files are line-oriented, and each line MUST be terminated by 422 a newline (LF), a carriage return (CR), or carriage return plus 423 newline (CRLF). Text tag files MUST end in the extension ".txt". 425 In all text tag files except for the bag declaration file, text MUST 426 be encoded in the character encoding specified in the "bagit.txt" bag 427 declaration file. Text tag files except for the bag declaration file 428 MAY include a byte-order mark (BOM) only if the specified encoding 429 requires it for proper decoding. (Note that UTF-8 does not.) 431 As specified in Section 2.1.1, the bag declaration file must be 432 encoded in UTF-8 and must not include a byte-order mark. 434 2.4. Bag Checksum Algorithms 436 The payload manifest and tag manifests assert integrity of the 437 payload and tags in a bag using checksum algorithms. The operation 438 of those algorithms, and the formatting of their output within a 439 manifest file, are generally beyond the scope of this specification, 440 except that the output format MUST be able to fit in the manifest 441 format specified in Section 2.1.3. 443 The name of the checksum algorithm MUST be normalized for use in the 444 manifest's filename by lowercasing the common name of the algorithm 445 and removing all non-alphanumeric characters. 447 Implementors of tools that create and validate bags SHOULD support at 448 least two widely implemented checksum algorithms: "md5" [RFC1321] and 449 "sha1" [RFC3174]. 451 3. Complete, Incomplete, and Valid bags 453 A _complete_ bag MUST have the following attributes: 455 1. Every required element MUST be present (Section 2.1). 457 2. Every file in every payload manifest MUST be present. 459 3. Every file in every tag manifest MUST be present. Tag files not 460 listed in a tag manifest MAY be present. 462 4. Every payload file MUST be listed in at least one manifest. 463 Payload files MAY be listed in more than one payload manifest. 465 5. Every element present MUST comply with this specification. 467 A bag is _incomplete_ when it exhibits any of the following 468 exceptions to the attributes of a complete bag: 470 1. One or more files in any payload manifest are absent. 472 2. One or more files in any tag manifest are absent. 474 3. A fetch.txt is present. Any files listed in any payload manifest 475 or any tag manifest which are absent MUST be listed in the 476 fetch.txt. 478 A _valid_ bag must have the following attributes: 480 1. The bag MUST be complete. 482 2. Every CHECKSUM in every payload manifest and tag manifest can be 483 sucessfully verified against the contents of its corresponding 484 FILENAME. 486 If a bag is neither valid, complete, nor incomplete, it is _invalid_. 487 Definitions for the various ways a bag may be invalid are not covered 488 by this specification. 490 Tag files that do not appear in a tag manifest can be modified, added 491 to, or removed from a bag without impacting the completeness or 492 validity of the bag. 494 4. Serialization 496 In some scenarios, it may be convenient to serialize the bag's 497 filesystem hierarchy (i.e., the base directory) into a single-file 498 archive format such as TAR or ZIP (the serialization) and then later 499 deserialize the serialization to recreate the filesystem hierarchy. 500 Several rules govern the serialization of a bag and apply equally to 501 all types of archive files: 503 1. The top-level directory of a serialization MUST contain only one 504 bag. 506 2. The serialization SHOULD have the same name as the bag's base 507 directory, but MUST have an extension added to identify the 508 format. For example, the receiver of "mybag.tar.gz" expects the 509 corresponding base directory to be created as "mybag". 511 3. A bag MUST NOT be serialized from within its base directory, but 512 from the parent of the base directory (where the base directory 513 appears as an entry). Thus, after a bag is deserialized in an 514 empty directory, a listing of that directory shows exactly one 515 entry. For example, deserializing "mybag.zip" in an empty 516 directory causes the creation of the base directory "mybag" and, 517 beneath "mybag", the creation of all payload and tag files. 519 4. The deserialization of a bag MUST produce a single base directory 520 bag with the top-level structure as described in this 521 specification without requiring any additional un-archiving step. 522 For example, after one un-archiving step it would be an error for 523 the "data/" directory to appear as "data.tar.gz". TAR and ZIP 524 files may appear inside the payload beneath the "data/" 525 directory, where they would be treated as any other payload file. 527 When serializing a bag, care must be taken to ensure that the archive 528 format's restrictions on file naming, such as allowable characters, 529 length, or character encoding, will support the requirements of the 530 systems on which it will be used. See Section 7.2. 532 5. Examples 534 5.1. Example of a basic bag 536 This is the layout of a basic bag containing an image and a companion 537 OCR file. Lines of file content are shown in parentheses beneath the 538 file name. 540 myfirstbag/ 541 | 542 | manifest-md5.txt 543 | (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png) 544 | (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt) 545 | 546 | bagit.txt 547 | (BagIt-version: 0.96 ) 548 | (Tag-File-Character-Encoding: UTF-8 ) 549 | 550 \--- data/ 551 | 552 | 27613-h/images/q172.png 553 | (... image bytes ... ) 554 | 555 | 27613-h/images/q172.txt 556 | (... OCR text ... ) 557 .... 559 5.2. Another example bag 561 The following example bag contains content from a web crawler. As 562 before, lines of file content are shown in parentheses beneath the 563 file name, with long lines continued indented on subsequent lines. 564 This bag is not complete until every component listed in the 565 "fetch.txt" file is retrieved. 567 mysecondbag/ 568 | 569 | manifest-md5.txt 570 | (93c53193ef96732c76e00b3fdd8f9dd3 data/Collection Overview.txt ) 571 | (e9c5753d65b1ef5aeb281c0bb880c6c8 data/Seed List.txt ) 572 | (61c96810788283dc7be157b340e4eff4 data/gov-20060601-050019.arc.gz) 573 | (55c7c80c6635d5a4c8fe76a940bf353e data/gov-20060601-100002.arc.gz) 574 | 575 | fetch.txt 576 | (http://WB20.Stanford.Edu/gov-06-2006/gov-20060601-050019.arc.gz 577 | 26583985 data/gov-20060601-050019.arc.gz ) 578 | (http://WB20.Stanford.Edu/gov-06-2006/gov-20060601-100002.arc.gz 579 | 99509720 data/gov-20060601-100002.arc.gz ) 580 | ( ...............................................................) 581 | 582 | bag-info.txt 583 | (Source-organization: California Digital Library ) 584 | (Organization-address: 415 20th St, 4th Floor, Oakland, CA 94612) 585 | (Contact-name: A. E. Newman ) 586 | (Contact-phone: +1 510-555-1234 ) 587 | (Contact-email: alfred@ucop.edu ) 588 | (External-Description: The collection "Local Davis Flood Control ) 589 | Collection" includes captured California State and local ) 590 | websites containing information on flood control resources for ) 591 | the Davis and Sacramento area. Sites were captured by UC Davis) 592 | curator Wrigley Spyder using the Web Archiving Service in ) 593 | February 2007 and October 2007. ) 594 | (Bag-date: 2008.04.15 ) 595 | (External-identifier: ark:/13030/fk4jm2bcp ) 596 | (Bag-size: about 22Gb ) 597 | (Payload-Oxum: 21836794142.831 ) 598 | (Internal-sender-identifier: UCDL ) 599 | (Internal-sender-description: UC Davis Libraries ) 600 | 601 | bagit.txt 602 | (BagIt-version: 0.96 ) 603 | (Tag-File-Character-Encoding: UTF-8 ) 604 | 605 \--- data/ 606 | 607 | Collection Overview.txt 608 | (... narrative description ... ) 609 | 610 | Seed List.txt 611 | (... list of crawler starting point URLs ... ) 612 .... 614 6. Security Considerations 616 6.1. Special directory characters 618 The paths specified in the payload manifest, tag manifest, and 619 "fetch.txt" file do not prohibit special directory characters which 620 might be significant on implementing systems. Implementors SHOULD 621 take care that files outside the bag directory structure are not 622 accessed when reading or writing files based on paths specified in a 623 bag. 625 For example, path characters such as ".." or "~" in a maliciously 626 crafted "fetch.txt" file might cause a naive implementation to 627 overwrite critical system files. 629 6.2. Control of URLs in fetch.txt 631 Implementors of tools that complete bags by retrieving URLs listed in 632 a "fetch.txt" file need to be aware that some of those URLs may point 633 to hosts, intentionally or unintentionally, that are not under 634 control of the bag's sender. Checksums are intended as a reasonable 635 guarantee against corruption during transit, not a strong 636 cryptographic protection against intentional spoofing. 638 6.3. File sizes in fetch.txt 640 The size of files, as optionally reported in the "fetch.txt" file, 641 cannot be guaranteed to match the actual file size to be downloaded. 642 Implementors SHOULD take care to appropriately handle cases where the 643 actual file size does not match the file size reported in the 644 fetch.txt. Implementors SHOULD NOT use the file size in the 645 "fetch.txt" file for critical resource allocation, such as buffer 646 sizing or storage requisitioning. 648 7. Practical Considerations (non-normative) 650 7.1. Disk and network transfer 652 When creating a bag on physical media (such as hard disk, CD-ROM, or 653 DVD) for transfer to another organization, the sender should select 654 and format the media in a manner compatible with both the content 655 requirements (e.g., file names and sizes) and the receiver's 656 technical infrastructure. If the receiver's infrastructure is not 657 known or the media needs to be compatible with a range of potential 658 receivers, consideration should be given to portability and common 659 usage. For example, a "lowest common denominator" for some potential 660 receivers could be USB disk drives formatted with the FAT32 661 filesystem. 663 Although overall bag size is unlimited in principle, network-based 664 transfers may involve constraints on the amount of bag data that a 665 receiver can receive at one time. It may be practical to split a 666 large bag into several smaller bags. 668 Transmitting a whole bag in serialized form as a single file will 669 tend to be the most straightforward mode of transfer. When 670 throughput is a priority, use of "fetch.txt" lends itself to an easy, 671 application-level parallelism in which the list of URL-addressed 672 items to fetch is divided among multiple processes. The mechanics of 673 sending and receiving bags over networks is otherwise out of scope of 674 the present document and may be facilitated by protocols such as 675 [GRABIT] and [SWORD]. 677 7.2. Interoperability 679 This section is not part of the BagIt specification. It describes 680 some practical considerations for bag creators and receivers circa 681 2010. 683 7.2.1. Checksum tools 685 Some cautions regarding bag interchange arise in regard to the 686 commonly available checksum tools distributed with the GNU Coreutils 687 package (md5sum, sha1sum, etc.), collectively referred to here as 688 "md5sum". First, md5sum can be run in binary or text mode; text mode 689 sometimes normalizes line-endings. While these modes appear to 690 produce the same checksums under Unix-like systems, they can produce 691 different checksums under Windows. When using md5sum, it may be 692 safest to run it in binary mode, with one caveat: a side-effect of 693 binary mode is that md5sum requires a space and an asterisk ('*'), 694 compared to two spaces in text mode, between the CHECKSUM and 695 FILENAME in its manifest format. 697 Due to the widespread use of md5sum (and its relatives), it is not 698 unexpected for bag receivers to see manifests in which CHECKSUM and 699 FILENAME are separated by a space followed by an asterisk. 700 Implementors creating or processing bags with md5sum should be aware 701 of these subtle differences, and ensure compliance with the manifest 702 specification in this document. Implementors creating and processing 703 bags with other tools may wish to be tolerant of asterisks found in 704 the manifests. 706 A final note about md5sum-generated manifests is that for a FILENAME 707 containing a backslash ('\'), the manifest line will have a backslash 708 inserted in front of the CHECKSUM and, under Windows, the backslashes 709 inside FILENAME may be doubled. 711 7.2.2. Windows and Unix file naming 713 As specified above, only the Unix-based path separator ('/') may be 714 used inside filenames listed in BagIt manifests and "fetch.txt" 715 files. When bags are exchanged between Windows and Unix platforms, 716 care should be taken to translate the path separator as needed. 717 Receivers of bags on physical media should be prepared for 718 filesystems created under either Windows or Unix. Besides the 719 fundamental difference between path separators ('\' and '/'), 720 generally, Windows filesystems have more limitations than Unix 721 filesystems. Windows path names have a maximum of 255 characters, 722 and none of these characters may be used in a path component: 724 < > : " / | ? * 726 Windows also reserves the following names: CON, PRN, AUX, NUL, COM1, 727 COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, 728 LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. See [MSFNAM] for more 729 information. 731 8. Acknowledgements 733 BagIt owes much to many thoughtful contributers and reviewers, 734 including Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Brad Hards, 735 Scott Fisher, Keith Johnson, Erik Hetzner, Leslie Johnston, David 736 Loy, Mark Phillips, Tracy Seneca, Brian Tingle, Adam Turoff, and Jim 737 Tuttle. 739 8.1. IANA Considerations 741 This draft does not request any action from IANA. 743 9. References 745 [ENCDEP] Tabata, K., "A Collaboration Model between Archival 746 Systems to Enhance the Reliability of Preservation by an 747 Enclose-and-Deposit Method", 2005, 748 . 750 [GRABIT] NDIIPP/CDL, "The GrabIt File Exchange Protocol", 2008, 751 . 753 [MSFNAM] Microsoft, "Naming a File", 2008, 754 . 756 [RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321, 757 April 1992. 759 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 760 Requirement Levels", BCP 14, RFC 2119, March 1997. 762 [RFC3174] Eastlake, D. and P. Jones, "US Secure Hash Algorithm 1 763 (SHA1)", RFC 3174, September 2001. 765 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 766 10646", STD 63, RFC 3629, November 2003. 768 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 769 Resource Identifier (URI): Generic Syntax", STD 66, 770 RFC 3986, January 2005. 772 [SWORD] UKOLN/JISC CETIS, "Simple Web-service Offering Repository 773 Deposit (SWORD)", 2008, 774 . 776 Authors' Addresses 778 John A. Kunze 779 California Digital Library 780 415 20th St, 4th Floor 781 Oakland, CA 94612 782 US 784 Email: jak@ucop.edu 786 Justin Littman 787 Goerge Washington University Libraries 788 2130 H Street, NW 789 Washington, DC 20052 790 USA 792 Email: justinlittman@gmail.com 794 Liz Madden 795 Library of Congress 796 101 Independence Avenue SE 797 Washington, DC 20540 798 USA 800 Email: emad@loc.gov 802 Ed Summers 803 University of Maryland 804 0301 Hornbake Library 805 College Park, MD 20742-7011 806 USA 808 Email: ehs@pobox.com 810 Andy Boyko 811 1538 Winding Way 812 Belmont, CA 94002 813 USA 815 Email: andrew@boyko.net 816 Brian Vargas 817 1354 Quincy St. NW 818 Washington, DC 20011 819 USA 821 Email: brian@ardvaark.net