idnits 2.17.1 draft-kunze-bagit-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (April 15, 2011) is 4760 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'ENCDEP' -- Possible downref: Non-RFC (?) normative reference: ref. 'GRABIT' -- Possible downref: Non-RFC (?) normative reference: ref. 'MSFNAM' ** Downref: Normative reference to an Informational RFC: RFC 1321 ** Downref: Normative reference to an Informational RFC: RFC 3174 -- Possible downref: Non-RFC (?) normative reference: ref. 'SWORD' Summary: 4 errors (**), 0 flaws (~~), 2 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Boyko 3 Internet-Draft 4 Expires: October 17, 2011 J. Kunze 5 California Digital Library 6 J. Littman 7 L. Madden 8 Library of Congress 9 B. Vargas 10 April 15, 2011 12 The BagIt File Packaging Format (V0.97) 13 http://www.ietf.org/internet-drafts/draft-kunze-bagit-06.txt 15 Abstract 17 This document specifies BagIt, a hierarchical file packaging format 18 for storage and transfer of arbitrary digital content. A "bag" has 19 just enough structure to enclose descriptive "tags" and a "payload" 20 but does not require knowledge of the payload's internal semantics. 21 This BagIt format should be suitable for disk-based or network-based 22 storage and transfer. 24 Status of this Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on October 17, 2011. 41 Copyright Notice 43 Copyright (c) 2011 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 59 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 4 60 1.2. Requirements . . . . . . . . . . . . . . . . . . . . . . . 4 61 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 62 2. Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 6 63 2.1. Required Elements . . . . . . . . . . . . . . . . . . . . 6 64 2.1.1. Bag Declaration: bagit.txt . . . . . . . . . . . . . . 6 65 2.1.2. Payload Directory: data/ . . . . . . . . . . . . . . . 6 66 2.1.3. Payload Manifest: manifest-.txt . . . . . . . . . 7 67 2.2. Optional Elements . . . . . . . . . . . . . . . . . . . . 7 68 2.2.1. Tag Manifest: tagmanifest-.txt . . . . . . . . . 7 69 2.2.2. Bag Metadata: bag-info.txt . . . . . . . . . . . . . . 8 70 2.2.3. Fetch File: fetch.txt . . . . . . . . . . . . . . . . 10 71 2.2.4. Other Tag Files . . . . . . . . . . . . . . . . . . . 11 72 2.3. Text Tag File Format . . . . . . . . . . . . . . . . . . . 11 73 2.4. Bag Checksum Algorithms . . . . . . . . . . . . . . . . . 11 74 3. Complete, Incomplete, and Valid bags . . . . . . . . . . . . . 12 75 4. Serialization . . . . . . . . . . . . . . . . . . . . . . . . 13 76 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 77 5.1. Example of a basic bag . . . . . . . . . . . . . . . . . . 14 78 5.2. Another example bag . . . . . . . . . . . . . . . . . . . 14 79 6. Security Considerations . . . . . . . . . . . . . . . . . . . 16 80 6.1. Special directory characters . . . . . . . . . . . . . . . 16 81 6.2. Control of URLs in fetch.txt . . . . . . . . . . . . . . . 16 82 6.3. File sizes in fetch.txt . . . . . . . . . . . . . . . . . 16 83 7. Practical Considerations (non-normative) . . . . . . . . . . . 17 84 7.1. Disk and network transfer . . . . . . . . . . . . . . . . 17 85 7.2. Interoperability . . . . . . . . . . . . . . . . . . . . . 17 86 7.2.1. Checksum tools . . . . . . . . . . . . . . . . . . . . 17 87 7.2.2. Windows and Unix file naming . . . . . . . . . . . . . 18 88 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19 89 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 20 90 Appendix A. Change history . . . . . . . . . . . . . . . . . . . 21 91 A.1. Changes from draft-05, 2011.04.15 . . . . . . . . . . . . 21 92 A.2. Changes from draft-04, 2009.12.20 . . . . . . . . . . . . 21 93 A.3. Changes from draft-03, 2009.04.11 . . . . . . . . . . . . 22 94 A.4. Changes from draft-02, 2008.07.11 . . . . . . . . . . . . 23 95 A.5. Changes from draft-01, 2008.05.30 . . . . . . . . . . . . 24 96 A.6. Changes from draft-00, 2008.03.24 . . . . . . . . . . . . 24 97 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 26 99 1. Introduction 101 1.1. Purpose 103 BagIt is a hierarchical file packaging format designed to support 104 disk-based or network-based storage and transfer of arbitrary digital 105 content. A bag consists of a "payload" and "tags". The content of 106 the payload is the custodial focus of the bag and is treated as 107 semantically opaque. The "tags" are metadata files intended to 108 facilitate and document the storage and transfer of the bag. The 109 name, BagIt, is inspired by the "enclose and deposit" method 110 [ENCDEP], sometimes referred to as "bag it and tag it". 112 Implementors of BagIt tools should consider interoperability between 113 different platforms, operating systems, toolsets, and languages. 114 Differences in path separators, newline characters, reserved file 115 names, and maximum path lengths are all possible barriers to moving 116 bags between different systems. Discussion of these issues may be 117 found in the Interoperability section of this document. 119 1.2. Requirements 121 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 122 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 123 document are to be interpreted as described in [RFC2119]. 125 An implementation is not compliant if it fails to satisfy one or more 126 of the MUST or REQUIRED level requirements for the protocols it 127 implements. An implementation that satisfies all the MUST or 128 REQUIRED level and all the SHOULD level requirements for its 129 protocols is said to be "unconditionally compliant"; one that 130 satisfies all the MUST level requirements but not all the SHOULD 131 level requirements for its protocols is said to be "conditionally 132 compliant." 134 1.3. Terminology 136 This specification uses a number of terms to describe BagIt, some of 137 which are in common use, some of which are newly defined by this 138 specification, and others which may have meanings obvious only to 139 those in the community from which this spec arose. Terms defined in 140 this section are intended to clarify any ambiguity. 142 bag A set of opaque data contained within the structure defined by 143 this specification. 145 bag declaration The tag file required to be in all bags conforming 146 to this specification. Contains tags necessary for bootstrapping 147 the reading and processing of the rest of a bag. See 148 Section 2.1.1. 150 bag checksum algorithm A reference to a cryptographic checksum 151 algorithm, such as MD5 or SHA-1, with its name normalized for use 152 in a manifest or tag manifest file name. See Section 2.4. 154 complete A bag which comprises all elements required by this 155 specification, with all files listed in all payload and tag 156 manifests present, all payload files present listed in at least 157 one manifest. See Section 3. 159 payload The data encapsulated by the bag. The contents of the 160 payload are opaque to this specification, and are always 161 considered as a set of octet streams. See Section 2.1.2. 163 serialized bag A bag that has been serialized into a single, 164 monolithic file. See Section 4. 166 tag directory A directory that contains one or more tag files. 168 tag file A file that contains metadata intended to facilitate and 169 document the storage and transfer of the bag. 171 valid A complete bag wherein every checksum in every payload 172 manifest and tag manifest can be successfully verified against the 173 corresponding payload file. See Section 2.1.2. 175 2. Structure 177 A bag consists of a base directory containing (1) a set of required 178 and optional tag files; (2) a sub-directory named "data", called the 179 payload directory; and (3) a set of optional tag directories. The 180 payload files in the payload directory are an arbitrary file 181 hierarchy (see Section 2.1.2). The tag files in the base directory 182 consist of one or more files named "manifest-_algorithm_.txt" (see 183 Section 2.1.3), a file named "bagit.txt" (see Section 2.1.1), and 184 zero or more additional tag files (see Section 2.2). The tag files 185 in the optional tag directories are arbitrary file hierarchies and 186 the tag directories MAY have any name that is not reserved for a file 187 or directory in this specification. 189 The base directory MAY have any name. 191 / 192 | bagit.txt 193 | manifest-.txt 194 | [optional additional tag files] 195 \--- data/ 196 | [payload files] 197 \--- [optional tag directories]/ 198 | [optional tag files] 200 2.1. Required Elements 202 2.1.1. Bag Declaration: bagit.txt 204 The "bagit.txt" tag file MUST consist of exactly two lines: 206 BagIt-Version: M.N 207 Tag-File-Character-Encoding: UTF-8 209 where M.N identifies the BagIt major (M) and minor (N) version 210 numbers, and UTF-8 identifies the character set encoding of tag 211 files. The bag declaration MUST be encoded in UTF-8, and MUST NOT 212 contain a byte-order mark (BOM). [RFC3629] 214 The appropriate version for a bag that conforms to this version of 215 the specification is "0.97". 217 2.1.2. Payload Directory: data/ 219 The base directory MUST contain a sub-directory named "data", called 220 the payload directory. 222 The payload directory contains the custodial content within the bag. 224 The files under the payload directory are called payload files, or 225 the payload. The payload is treated as octet streams for all 226 purposes relating to this specification, and is not otherwise 227 prescribed. 229 2.1.3. Payload Manifest: manifest-.txt 231 A payload manifest is a tag file that lists payload files and 232 checksums for those payload files generated using a particular bag 233 checksum algorithm. Every bag MUST contain one payload manifest 234 file, and MAY contain more than one. A payload manifest file MUST 235 have a name of the form manifest-_algorithm_.txt, where _algorithm_ 236 is a string specifying the bag checksum algorithm used in that 237 manifest, such as: 239 manifest-md5.txt 240 manifest-sha1.txt 242 A bag MUST NOT contain more than one payload manifest for a 243 particular bag checksum algorithm. 245 Each line of a payload manifest file MUST be of the form: 247 CHECKSUM FILENAME 249 where FILENAME is the pathname of a file relative to the base 250 directory and CHECKSUM is a hex-encoded checksum calculated according 251 to _algorithm_ over every octet in the file. The hex-encoded 252 checksum MAY use uppercase and/or lowercase letters. The slash 253 character ('/') MUST be used as a path separator in FILENAME. One or 254 more linear whitespace characters (spaces or tabs) MUST separate 255 CHECKSUM from FILENAME. An asterisk ('*') MAY preceed FILENAME for 256 interoperability on some platforms (see Section 7.2.1). There is no 257 limitation on the length of a pathname. The payload manifest MUST 258 NOT reference files outside the payload directory. 260 Payload manifests only include the pathnames of files. Because of 261 this, a payload manifest cannot reference empty directories. To 262 account for an empty directory, a bag creator may wish to include at 263 least one file in that directory; it suffices, for example, to 264 include a zero-length file named ".keep". 266 2.2. Optional Elements 268 2.2.1. Tag Manifest: tagmanifest-.txt 270 A tag manifest is a tag file that lists other tag files and checksums 271 for those tag files generated using a particular bag checksum 272 algorithm. A bag MAY contain one or more tag manifests. A tag 273 manifest file MUST have a name of the form "tagmanifest- 274 _algorithm_.txt", where _algorithm_ is a string specifying the bag 275 checksum algorithm used in that manifest, such as: 277 tagmanifest-md5.txt 278 tagmanifest-sha1.txt 280 A tag manifest file has the same form as the payload file manifest 281 file described in Section 2.1.3, but MUST NOT list any payload files. 282 As a result, no FILENAME listed in a tag manifest begins "data/". 284 2.2.2. Bag Metadata: bag-info.txt 286 The "bag-info.txt" file is a tag file that contains metadata elements 287 describing the bag and the payload. The metadata elements contained 288 in the "bag-info.txt" file are intended primarily for human 289 readability. All metadata elements are optional and MAY be repeated. 290 Implementations SHOULD assume that the ordering is significant and 291 provide access to the metadata elements in the order they are given 292 in the "bag-info.txt" file. 294 A metadata element MUST consist of a label, a colon, and a value, 295 each separated by optional whitespace. It is RECOMMENDED that lines 296 not exceed 79 characters in length. Long values may be continued 297 onto the next line by inserting a newline (LF), a carriage return 298 (CR), or carriage return plus newline (CRLF) and indenting the next 299 line with linear white space (spaces or tabs). 301 Reserved metadata element names are case-insensitive and defined as 302 follows. 304 Source-Organization Organization transferring the content. 306 Organization-Address Mailing address of the organization. 308 Contact-Name Person at the source organization who is responsible 309 for the content transfer. 311 Contact-Phone International format telephone number of person or 312 position responsible. 314 Contact-Email Fully qualified email address of person or position 315 responsible. 317 External-Description A brief explanation of the contents and 318 provenance. 320 Bagging-Date Date (YYYY-MM-DD) that the content was prepared for 321 delivery. 323 External-Identifier A sender-supplied identifier for the bag. 325 Bag-Size Size or approximate size of the bag being transferred, 326 followed by an abbreviation such as MB (megabytes), GB, or TB; for 327 example, 42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum 328 (described next), Bag-Size is intended for human consumption. 330 Payload-Oxum The "octetstream sum" of the payload, namely, a two- 331 part number of the form "OctetCount.StreamCount", where OctetCount 332 is the total number of octets (8-bit bytes) across all payload 333 file content and StreamCount is the total number of payload files. 334 Payload-Oxum should be included in "bag-info.txt" if at all 335 possible. Compared to Bag-Size (above), Payload-Oxum is intended 336 for machine consumption. 338 Bag-Group-Identifier A sender-supplied identifier for the set, if 339 any, of bags to which it logically belongs. This identifier must 340 be unique across the sender's content, and if recognizable as 341 belonging to a globally unique scheme, the receiver should make an 342 effort to honor reference to it. 344 Bag-Count Two numbers separated by "of", in particular, "N of T", 345 where T is the total number of bags in a group of bags and N is 346 the ordinal number within the group; if T is not known, specify it 347 as "?" (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 348 145. 350 Internal-Sender-Identifier An alternate sender-specific identifier 351 for the content and/or bag. 353 Internal-Sender-Description A sender-local prose description of the 354 contents of the bag. 356 In addition to these metadata elements, other arbitrary metadata 357 elements may also be present. 359 Here is an example "bag-info.txt" file. 361 Source-Organization: Spengler University 362 Organization-Address: 1400 Elm St., Cupertino, California, 95014 363 Contact-Name: Edna Janssen 364 Contact-Phone: +1 408-555-1212 365 Contact-Email: ej@spengler.edu 366 External-Description: Uncompressed greyscale TIFF images from the 367 Yoshimuri papers colle... 368 Bagging-Date: 2008-01-15 369 External-Identifier: spengler_yoshimuri_001 370 Bag-Size: 260 GB 371 Payload-Oxum: 279164409832.1198 372 Bag-Group-Identifier: spengler_yoshimuri 373 Bag-Count: 1 of 15 374 Internal-Sender-Identifier: /storage/images/yoshimuri 375 Internal-Sender-Description: Uncompressed greyscale TIFFs created 376 from microfilm and are... 378 2.2.3. Fetch File: fetch.txt 380 For reasons of efficiency, a bag MAY be sent with a list of files to 381 be fetched and added to the payload before it can meaningfully be 382 checked for completeness. An OPTIONAL tag file named "fetch.txt" 383 contains such a list. Each line of "fetch.txt" has the form 385 URL LENGTH FILENAME 387 where URL identifies the file to be fetched, LENGTH is the number of 388 octets in the file (or "-", to leave it unspecified), and FILENAME 389 identifies the corresponding payload file, relative to the base 390 directory. The slash character ('/') MUST be used as a path 391 separator in FILENAME. If FILENAME begins with a slash character, 392 the destination MUST still be treated as relative to the bag base 393 directory. One or more linear whitespace characters (spaces or tabs) 394 MUST separate these three values, and any such characters in the URL 395 MUST be percent-encoded [RFC3986]. There is no limitation on the 396 length of any of the fields in the "fetch.txt". 398 The "fetch.txt" file allows a bag to be transmitted with "holes" in 399 it, which can be practical for several reasons. For example, it 400 obviates the need for the sender to stage a large serialized copy of 401 the content while the bag is transferred to the receiver. Also, this 402 method allows a sender to construct a bag from components that are 403 either a subset of logically related components (e.g., the localized 404 logical object could be much larger than what is intended for export) 405 or assembled from logically distributed sources (e.g., the object 406 components for export are not stored locally under one filesystem 407 tree). 409 2.2.4. Other Tag Files 411 A bag MAY contain other tag files that are not defined by this 412 specification. Implementations SHOULD ignore the content of any 413 unexpected tag files, except when they are listed in a tag manifest. 414 When unexpected tag files are listed in a tag manifest, 415 implementations MUST only treat the content of those tag files as 416 octet streams for the purpose of checksum verification. 418 2.3. Text Tag File Format 420 All tag files specifically described in this specification MUST 421 adhere to the text tag file format described below. Other tag files 422 MAY adhere to the text tag file format described below. 424 Text tag files are line-oriented, and each line MUST be terminated by 425 a newline (LF), a carriage return (CR), or carriage return plus 426 newline (CRLF). Text tag files MUST end in the extension ".txt". 428 In all text tag files except for the bag declaration file, text MUST 429 be encoded in the character encoding specified in the "bagit.txt" bag 430 declaration file. Text tag files except for the bag declaration file 431 MAY include a byte-order mark (BOM) only if the specified encoding 432 requires it for proper decoding. (Note that UTF-8 does not.) 434 As specified in Section 2.1.1, the bag declaration file must be 435 encoded in UTF-8 and must not include a byte-order mark. 437 2.4. Bag Checksum Algorithms 439 The payload manifest and tag manifests assert integrity of the 440 payload and tags in a bag using checksum algorithms. The operation 441 of those algorithms, and the formatting of their output within a 442 manifest file, are generally beyond the scope of this specification, 443 except that the output format MUST be able to fit in the manifest 444 format specified in Section 2.1.3. 446 The name of the checksum algorithm MUST be normalized for use in the 447 manifest's filename by lowercasing the common name of the algorithm 448 and removing all non-alphanumeric characters. 450 Implementors of tools that create and validate bags SHOULD support at 451 least two widely implemented checksum algorithms: "md5" [RFC1321] and 452 "sha1" [RFC3174]. 454 3. Complete, Incomplete, and Valid bags 456 A _complete_ bag MUST have the following attributes: 458 1. Every required element MUST be present (Section 2.1). 460 2. Every file in every payload manifest MUST be present. 462 3. Every file in every tag manifest MUST be present. Tag files not 463 listed in a tag manifest MAY be present. 465 4. Every payload file MUST be listed in at least one manifest. 466 Payload files MAY be listed in more than one payload manifest. 468 5. Every element present MUST comply with this specification. 470 A bag is _incomplete_ when it exhibits any of the following 471 exceptions to the attributes of a complete bag: 473 1. One or more files in any payload manifest are absent. 475 2. One or more files in any tag manifest are absent. 477 3. A fetch.txt is present. Any files listed in any payload manifest 478 or any tag manifest which are absent MUST be listed in the 479 fetch.txt. 481 A _valid_ bag must have the following attributes: 483 1. The bag MUST be complete. 485 2. Every CHECKSUM in every payload manifest and tag manifest can be 486 sucessfully verified against the contents of its corresponding 487 FILENAME. 489 If a bag is neither valid, complete, nor incomplete, it is _invalid_. 490 Definitions for the various ways a bag may be invalid are not covered 491 by this specification. 493 Tag files that do not appear in a tag manifest can be modified, added 494 to, or removed from a bag without impacting the completeness or 495 validity of the bag. 497 4. Serialization 499 In some scenarios, it may be convenient to serialize the bag's 500 filesystem hierarchy (i.e., the base directory) into a single-file 501 archive format such as TAR or ZIP (the serialization) and then later 502 deserialize the serialization to recreate the filesystem hierarchy. 503 Several rules govern the serialization of a bag and apply equally to 504 all types of archive files: 506 1. The top-level directory of a serialization MUST contain only one 507 bag. 509 2. The serialization SHOULD have the same name as the bag's base 510 directory, but MUST have an extension added to identify the 511 format. For example, the receiver of "mybag.tar.gz" expects the 512 corresponding base directory to be created as "mybag". 514 3. A bag MUST NOT be serialized from within its base directory, but 515 from the parent of the base directory (where the base directory 516 appears as an entry). Thus, after a bag is deserialized in an 517 empty directory, a listing of that directory shows exactly one 518 entry. For example, deserializing "mybag.zip" in an empty 519 directory causes the creation of the base directory "mybag" and, 520 beneath "mybag", the creation of all payload and tag files. 522 4. The deserialization of a bag MUST produce a single base directory 523 bag with the top-level structure as described in this 524 specification without requiring any additional un-archiving step. 525 For example, after one un-archiving step it would be an error for 526 the "data/" directory to appear as "data.tar.gz". TAR and ZIP 527 files may appear inside the payload beneath the "data/" 528 directory, where they would be treated as any other payload file. 530 When serializing a bag, care must be taken to ensure that the archive 531 format's restrictions on file naming, such as allowable characters, 532 length, or character encoding, will support the requirements of the 533 systems on which it will be used. See Section 7.2. 535 5. Examples 537 5.1. Example of a basic bag 539 This is the layout of a basic bag containing an image and a companion 540 OCR file. Lines of file content are shown in parentheses beneath the 541 file name. 543 myfirstbag/ 544 | 545 | manifest-md5.txt 546 | (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png) 547 | (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt) 548 | 549 | bagit.txt 550 | (BagIt-version: 0.96 ) 551 | (Tag-File-Character-Encoding: UTF-8 ) 552 | 553 \--- data/ 554 | 555 | 27613-h/images/q172.png 556 | (... image bytes ... ) 557 | 558 | 27613-h/images/q172.txt 559 | (... OCR text ... ) 560 .... 562 5.2. Another example bag 564 The following example bag contains content from a web crawler. As 565 before, lines of file content are shown in parentheses beneath the 566 file name, with long lines continued indented on subsequent lines. 567 This bag is not complete until every component listed in the 568 "fetch.txt" file is retrieved. 570 mysecondbag/ 571 | 572 | manifest-md5.txt 573 | (93c53193ef96732c76e00b3fdd8f9dd3 data/Collection Overview.txt ) 574 | (e9c5753d65b1ef5aeb281c0bb880c6c8 data/Seed List.txt ) 575 | (61c96810788283dc7be157b340e4eff4 data/gov-20060601-050019.arc.gz) 576 | (55c7c80c6635d5a4c8fe76a940bf353e data/gov-20060601-100002.arc.gz) 577 | 578 | fetch.txt 579 | (http://WB20.Stanford.Edu/gov-06-2006/gov-20060601-050019.arc.gz 580 | 26583985 data/gov-20060601-050019.arc.gz ) 581 | (http://WB20.Stanford.Edu/gov-06-2006/gov-20060601-100002.arc.gz 582 | 99509720 data/gov-20060601-100002.arc.gz ) 583 | ( ...............................................................) 584 | 585 | bag-info.txt 586 | (Source-organization: California Digital Library ) 587 | (Organization-address: 415 20th St, 4th Floor, Oakland, CA 94612) 588 | (Contact-name: A. E. Newman ) 589 | (Contact-phone: +1 510-555-1234 ) 590 | (Contact-email: alfred@ucop.edu ) 591 | (External-Description: The collection "Local Davis Flood Control ) 592 | Collection" includes captured California State and local ) 593 | websites containing information on flood control resources for ) 594 | the Davis and Sacramento area. Sites were captured by UC Davis) 595 | curator Wrigley Spyder using the Web Archiving Service in ) 596 | February 2007 and October 2007. ) 597 | (Bag-date: 2008.04.15 ) 598 | (External-identifier: ark:/13030/fk4jm2bcp ) 599 | (Bag-size: about 22Gb ) 600 | (Payload-Oxum: 21836794142.831 ) 601 | (Internal-sender-identifier: UCDL ) 602 | (Internal-sender-description: UC Davis Libraries ) 603 | 604 | bagit.txt 605 | (BagIt-version: 0.96 ) 606 | (Tag-File-Character-Encoding: UTF-8 ) 607 | 608 \--- data/ 609 | 610 | Collection Overview.txt 611 | (... narrative description ... ) 612 | 613 | Seed List.txt 614 | (... list of crawler starting point URLs ... ) 615 .... 617 6. Security Considerations 619 6.1. Special directory characters 621 The paths specified in the payload manifest, tag manifest, and 622 "fetch.txt" file do not prohibit special directory characters which 623 might be significant on implementing systems. Implementors SHOULD 624 take care that files outside the bag directory structure are not 625 accessed when reading or writing files based on paths specified in a 626 bag. 628 For example, path characters such as ".." or "~" in a maliciously 629 crafted "fetch.txt" file might cause a naive implementation to 630 overwrite critical system files. 632 6.2. Control of URLs in fetch.txt 634 Implementors of tools that complete bags by retrieving URLs listed in 635 a "fetch.txt" file need to be aware that some of those URLs may point 636 to hosts, intentionally or unintentionally, that are not under 637 control of the bag's sender. Checksums are intended as a reasonable 638 guarantee against corruption during transit, not a strong 639 cryptographic protection against intentional spoofing. 641 6.3. File sizes in fetch.txt 643 The size of files, as optionally reported in the "fetch.txt" file, 644 cannot be guaranteed to match the actual file size to be downloaded. 645 Implementors SHOULD take care to appropriately handle cases where the 646 actual file size does not match the file size reported in the 647 fetch.txt. Implementors SHOULD NOT use the file size in the 648 "fetch.txt" file for critical resource allocation, such as buffer 649 sizing or storage requisitioning. 651 7. Practical Considerations (non-normative) 653 7.1. Disk and network transfer 655 When creating a bag on physical media (such as hard disk, CD-ROM, or 656 DVD) for transfer to another organization, the sender should select 657 and format the media in a manner compatible with both the content 658 requirements (e.g., file names and sizes) and the receiver's 659 technical infrastructure. If the receiver's infrastructure is not 660 known or the media needs to be compatible with a range of potential 661 receivers, consideration should be given to portability and common 662 usage. For example, a "lowest common denominator" for some potential 663 receivers could be USB disk drives formatted with the FAT32 664 filesystem. 666 Although overall bag size is unlimited in principle, network-based 667 transfers may involve constraints on the amount of bag data that a 668 receiver can receive at one time. It may be practical to split a 669 large bag into several smaller bags. 671 Transmitting a whole bag in serialized form as a single file will 672 tend to be the most straightforward mode of transfer. When 673 throughput is a priority, use of "fetch.txt" lends itself to an easy, 674 application-level parallelism in which the list of URL-addressed 675 items to fetch is divided among multiple processes. The mechanics of 676 sending and receiving bags over networks is otherwise out of scope of 677 the present document and may be facilitated by protocols such as 678 [GRABIT] and [SWORD]. 680 7.2. Interoperability 682 This section is not part of the BagIt specification. It describes 683 some practical considerations for bag creators and receivers circa 684 2010. 686 7.2.1. Checksum tools 688 Some cautions regarding bag interchange arise in regard to the 689 commonly available checksum tools distributed with the GNU Coreutils 690 package (md5sum, sha1sum, etc.), collectively referred to here as 691 "md5sum". First, md5sum can be run in binary or text mode; text mode 692 sometimes normalizes line-endings. While these modes appear to 693 produce the same checksums under Unix-like systems, they can produce 694 different checksums under Windows. When using md5sum, it may be 695 safest to run it in binary mode, with one caveat: a side-effect of 696 binary mode is that md5sum requires a space and an asterisk ('*'), 697 compared to two spaces in text mode, between the CHECKSUM and 698 FILENAME in its manifest format. 700 Due to the widespread use of md5sum (and its relatives), it is not 701 unexpected for bag receivers to see manifests in which CHECKSUM and 702 FILENAME are separated by a space followed by an asterisk. 703 Implementors creating or processing bags with md5sum should be aware 704 of these subtle differences, and ensure compliance with the manifest 705 specification in this document. Implementors creating and processing 706 bags with other tools may wish to be tolerant of asterisks found in 707 the manifests. 709 A final note about md5sum-generated manifests is that for a FILENAME 710 containing a backslash ('\'), the manifest line will have a backslash 711 inserted in front of the CHECKSUM and, under Windows, the backslashes 712 inside FILENAME may be doubled. 714 7.2.2. Windows and Unix file naming 716 As specified above, only the Unix-based path separator ('/') may be 717 used inside filenames listed in BagIt manifests and "fetch.txt" 718 files. When bags are exchanged between Windows and Unix platforms, 719 care should be taken to translate the path separator as needed. 720 Receivers of bags on physical media should be prepared for 721 filesystems created under either Windows or Unix. Besides the 722 fundamental difference between path separators ('\' and '/'), 723 generally, Windows filesystems have more limitations than Unix 724 filesystems. Windows path names have a maximum of 255 characters, 725 and none of these characters may be used in a path component: 727 < > : " / | ? * 729 Windows also reserves the following names: CON, PRN, AUX, NUL, COM1, 730 COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, 731 LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. See [MSFNAM] for more 732 information. 734 8. Acknowledgements 736 BagIt owes much to many thoughtful contributers and reviewers, 737 including Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Brad Hards, 738 Scott Fisher, Keith Johnson, Erik Hetzner, Leslie Johnston, David 739 Loy, Mark Phillips, Tracy Seneca, Brian Tingle, Adam Turoff, and Jim 740 Tuttle. 742 9. References 744 [ENCDEP] Tabata, K., "A Collaboration Model between Archival 745 Systems to Enhance the Reliability of Preservation by an 746 Enclose-and-Deposit Method", 2005, 747 . 749 [GRABIT] NDIIPP/CDL, "The GrabIt File Exchange Protocol", 2008, 750 . 752 [MSFNAM] Microsoft, "Naming a File", 2008, 753 . 755 [RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321, 756 April 1992. 758 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 759 Requirement Levels", BCP 14, RFC 2119, March 1997. 761 [RFC3174] Eastlake, D. and P. Jones, "US Secure Hash Algorithm 1 762 (SHA1)", RFC 3174, September 2001. 764 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 765 10646", STD 63, RFC 3629, November 2003. 767 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 768 Resource Identifier (URI): Generic Syntax", STD 66, 769 RFC 3986, January 2005. 771 [SWORD] UKOLN/JISC CETIS, "Simple Web-service Offering Repository 772 Deposit (SWORD)", 2008, 773 . 775 Appendix A. Change history 777 (This appendix to be removed in the final draft.) 779 A.1. Changes from draft-05, 2011.04.15 781 Allowing tag directories. 783 Fixed definition of valid. 785 Clarified that tag files do not need to be text files. 787 Clarified that repeatability and ordering of metadata elements in 788 bag-info.txt. 790 Clarified case of hex-encoding in manifests. 792 A.2. Changes from draft-04, 2009.12.20 794 Re-replaced entity reference for current version number in artwork, 795 where it doesn't appear to work (xml2rfc bug?). Updated to latest 796 IETF Trust Legal Provisions 200902. (jak) 798 Re-wording Tag File Format section. 800 Adding new section for Other Tag Files. 802 Minor clarification on the Fetch File description. 804 Synchronized the language between the Payload Manifest and the Tag 805 Manifest sections. 807 Minor grammatical corrections and clarifications to the Payload 808 Manifest section. 810 Re-worded and re-ordered payload section and structure intro. Except 811 for the base directory naming, the structure intro is strictly 812 explanatory. 814 Replaced current version number with entity reference. 816 Move checksum algorithm information into its own section. 818 Major re-wording of section on validity and completeness to provide 819 explicit, enumerated definitions for "valid", "complete", and 820 "incomplete" bags. 822 Added explicit wording about byte order marks (BOM) in UTF-8. 824 Re-named section titles for better clarity. 826 Re-wording security consideration on checksum purposes to more 827 accurately reflect the real purposes of the checksums. 829 Major restructuring of the document for brevity and precision. 831 Added RFC 2119 language. 833 Added terminology section. 835 Cleaning up example artwork so that parenthesis are more consistently 836 used. 838 Explicitly stated version number required for comforming to the 839 current version of the specification. 841 Various minor tweaks to grammar and wording. 843 A.3. Changes from draft-03, 2009.04.11 845 Re-worded interoperability statement in the Introduction. (Justin) 847 Added statements regarding no limitations on various paths, URI, and 848 other lengths. 850 Clarified that the bag directory may not contain any other 851 directories except for the "data" directory. 853 A soel carriage return character is now explicitly allowed as a valid 854 line separator. 856 Tag file encoding requirements are now required to be as-stated in 857 the "bagit.txt". The "bagit.txt" file is explicitly required to be 858 in UTF-8. 860 Wording cleanup, clarifying payload file manifests and tag file 861 manifests. 863 Tags in "bag-info.txt" no longer have any ordering requirement. 865 Tag formatting now explicitly states where significant whitespace 866 begins in the tag. 868 After some consideration, added some security considerations. 870 Made it clear that a bag may contain other bags, re: serialization. 872 Re-worded interoperabiilty to concerns to require creators to be 873 spec-compliant, and readers to be tolerant of known potential issues. 875 Specificity to the FILENAME element in "fetch.txt" is relative to the 876 bag root, and to make sure to treat leading slashes as relative. 878 Updated acknowledgements. 880 Various other minor edits for clarity and readibility. 882 A.4. Changes from draft-02, 2008.07.11 884 Added language to require the slash ('/') as path separator, 885 regardless of the platform where the bag was created. Added an extra 886 co-author and an Acknowledgements section. 888 Deleted the unnecessary "(optional)" from four of the metadata 889 elements, since all metadata elements are optional. Softened the 890 equivalence of the serialization name and name of the contained bag 891 base directory. Replaced the reference to RFC2822 with an inline 892 description of the simpler bag-info.txt format. 894 Changed to a variable linear whitespace separator in the description 895 of manifest layout and in manifest examples. Added two paragraphs 896 under a new "Checksum tools" subsection of the Interoperability 897 section to describe some of the peculiarities of dealing with the 898 widely used GNU Coreutils checksum tools. 900 With the new version, 0.96, there is an important and incompatible 901 change of file name (package-info.txt -> bag-info.txt), metadata 902 element names (Package-Size -> Bag-Size, Packing-Date -> Bagging- 903 Date), and descriptive language to replace the noun "package" with 904 "bag" throughout the spec. This was to reduce unnecessary synonymy 905 and free up the noun "package" to name the physical container (e.g., 906 a mailing carton) used to transfer hard disks. 908 In section 7, another important change is the introduction of the 909 Payload-Oxum ("octetstream sum") metadata element to convey precise, 910 machine-readable payload size information for capacity planning 911 (especially useful when preparing to receive files listed in 912 fetch.txt). The Bag-size definition was adjusted to steer it more 913 towards human consumption. 915 In section 2.2 the spec now requires exactly two spaces between 916 checksum and filename in manifests. This results from the experience 917 that as of 2008, not all widely available validation tools are 918 flexible in the kind of separating whitespace recognized. The 919 examples have been updated to include use the two-space form as well. 921 Comment added that while overall bag size is unlimited, practical 922 limitations on the amount of data that a receiver can stage may 923 warrant splitting a large bag into several smaller bags. 925 Added a reference to the SWORD protocol. 927 Minor edits for scanning and reformatting to cut down line length for 928 some figures that exceeded 72 chars (limit for Internet-Drafts). 930 A.5. Changes from draft-01, 2008.05.30 932 Added mention of preserving empty directories. 934 Simplified function of "tag checksum file" to "tag manifest", having 935 same format as payload manifest. The tag manifest is optional and 936 need not include every tag file. 938 Loosened interpretation of payload manifest to "union" concept: every 939 payload file must be listed in at least one manifest but need not be 940 listed in every manifest. 942 Shortened the Introduction's first paragraph to be less duplicative 943 of text in the Abstract. 945 Changed Delivery-Date to Packing-Date. 947 Correctly sorted the author list and clarification of deserialization 948 wording. 950 A.6. Changes from draft-00, 2008.03.24 952 Author address corrections and miscellaneous stylistic edits. 954 Added some mention of physical media-based transfers, preferred 955 characteristics of transfer filesystems, and network transfer issues. 957 Added basic bag example early and changed the narrative to more 958 clearly delineate component files. 960 Wording changes under fetch.txt, and note that fetch.txt will need to 961 be modified before bag return. 963 Fixed checksum encoding reference to base64 rather than hex. (B. 964 Vargas) 966 Described simple normalization approach for checksum algorithm names. 967 (B. Vargas) 968 In the example bag, add the ARC files found in the fetch.txt to the 969 manifest as well (A. Turoff) 971 Authors' Addresses 973 Andy Boyko 974 1438 Kingfisher Way 975 Sunnyvale, CA 94087 976 USA 978 Email: andy@boyko.net 980 John A. Kunze 981 California Digital Library 982 415 20th St, 4th Floor 983 Oakland, CA 94612 984 US 986 Email: jak@ucop.edu 988 Justin Littman 989 Library of Congress 990 101 Independence Avenue SE 991 Washington, DC 20540 992 USA 994 Email: jlit@loc.gov 996 Liz Madden 997 Library of Congress 998 101 Independence Avenue SE 999 Washington, DC 20540 1000 USA 1002 Email: emad@loc.gov 1004 Brian Vargas 1005 1354 Quincy St. NW 1006 Washington, DC 20011 1007 USA 1009 Email: brian@ardvaark.net