idnits 2.17.1
draft-kunze-bagit-16.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
No issues found here.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the IETF Trust and authors Copyright Line does not
match the current year
== The document seems to lack the recommended RFC 2119 boilerplate, even if
it appears to use RFC 2119 keywords -- however, there's a paragraph with
a matching beginning. Boilerplate error?
(The document does seem to have the reference to RFC 2119 which the
ID-Checklist requires).
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- The document date (June 4, 2018) is 2153 days in the past. Is this
intentional?
Checking references for intended status: Informational
----------------------------------------------------------------------------
-- Obsolete informational reference (is this intentional?): RFC 4234
(Obsoleted by RFC 5234)
Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group J. Kunze
3 Internet-Draft California Digital Library
4 Intended status: Informational J. Littman
5 Expires: December 6, 2018 George Washington University Libraries
6 E. Madden
7 J. Scancella
8 C. Adams
9 Library of Congress
10 June 4, 2018
12 The BagIt File Packaging Format (V1.0)
13 draft-kunze-bagit-16
15 Abstract
17 This document describes BagIt, a set of hierarchical file layout
18 conventions for storage and transfer of arbitrary digital content. A
19 "bag" has just enough structure to enclose descriptive metadata
20 "tags" and a file "payload" but does not require knowledge of the
21 payload's internal semantics. This BagIt format is suitable for
22 reliable storage and transfer.
24 Status of This Memo
26 This Internet-Draft is submitted in full conformance with the
27 provisions of BCP 78 and BCP 79.
29 Internet-Drafts are working documents of the Internet Engineering
30 Task Force (IETF). Note that other groups may also distribute
31 working documents as Internet-Drafts. The list of current Internet-
32 Drafts is at https://datatracker.ietf.org/drafts/current/.
34 Internet-Drafts are draft documents valid for a maximum of six months
35 and may be updated, replaced, or obsoleted by other documents at any
36 time. It is inappropriate to use Internet-Drafts as reference
37 material or to cite them other than as "work in progress."
39 This Internet-Draft will expire on December 6, 2018.
41 Copyright Notice
43 Copyright (c) 2018 IETF Trust and the persons identified as the
44 document authors. All rights reserved.
46 This document is subject to BCP 78 and the IETF Trust's Legal
47 Provisions Relating to IETF Documents
48 (https://trustee.ietf.org/license-info) in effect on the date of
49 publication of this document. Please review these documents
50 carefully, as they describe your rights and restrictions with respect
51 to this document. Code Components extracted from this document must
52 include Simplified BSD License text as described in Section 4.e of
53 the Trust Legal Provisions and are provided without warranty as
54 described in the Simplified BSD License.
56 Table of Contents
58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
59 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 3
60 1.2. Requirements . . . . . . . . . . . . . . . . . . . . . . 3
61 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4
62 2. Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 5
63 2.1. Required Elements . . . . . . . . . . . . . . . . . . . . 5
64 2.1.1. Bag Declaration: bagit.txt . . . . . . . . . . . . . 5
65 2.1.2. Payload Directory: data/ . . . . . . . . . . . . . . 6
66 2.1.3. Payload Manifest: manifest-algorithm.txt . . . . . . 6
67 2.2. Optional Elements . . . . . . . . . . . . . . . . . . . . 7
68 2.2.1. Tag Manifest: tagmanifest-algorithm.txt . . . . . . . 7
69 2.2.2. Bag Metadata: bag-info.txt . . . . . . . . . . . . . 8
70 2.2.3. Fetch File: fetch.txt . . . . . . . . . . . . . . . . 10
71 2.2.4. Other Tag Files . . . . . . . . . . . . . . . . . . . 11
72 2.3. Text Tag File Format . . . . . . . . . . . . . . . . . . 11
73 2.4. Bag Checksum Algorithms . . . . . . . . . . . . . . . . . 12
74 3. Complete and Valid bags . . . . . . . . . . . . . . . . . . . 12
75 4. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 13
76 4.1. Example of a basic bag . . . . . . . . . . . . . . . . . 13
77 4.2. Example bag using fetch.txt . . . . . . . . . . . . . . . 14
78 5. Security Considerations . . . . . . . . . . . . . . . . . . . 14
79 5.1. Special directory characters . . . . . . . . . . . . . . 14
80 5.2. Control of URLs in fetch.txt . . . . . . . . . . . . . . 15
81 5.3. File sizes in fetch.txt . . . . . . . . . . . . . . . . . 15
82 6. Practical Considerations (non-normative) . . . . . . . . . . 15
83 6.1. Interoperability . . . . . . . . . . . . . . . . . . . . 15
84 6.1.1. Filename normalization . . . . . . . . . . . . . . . 15
85 6.1.2. Windows and Unix file naming . . . . . . . . . . . . 17
86 6.1.3. Legacy checksum tools . . . . . . . . . . . . . . . . 18
87 7. Augmented Backus-Naur Form (non-normative) . . . . . . . . . 19
88 7.1. Bag Declaration: bagit.txt . . . . . . . . . . . . . . . 19
89 7.2. Payload Manifest: manifest-algorithm.txt . . . . . . . . 19
90 7.3. Bag Metadata: bag-info.txt . . . . . . . . . . . . . . . 19
91 7.4. Fetch File: fetch.txt . . . . . . . . . . . . . . . . . . 20
92 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 20
93 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 20
94 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20
95 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 20
96 11.1. Normative References . . . . . . . . . . . . . . . . . . 20
97 11.2. Informative References . . . . . . . . . . . . . . . . . 21
98 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22
100 1. Introduction
102 1.1. Purpose
104 BagIt is a set of hierarchical file layout conventions designed to
105 support storage and transfer of arbitrary digital content. A bag
106 consists of a directory containing the payload files and other
107 accompanying metadata files known as "tag" files. The "tags" are
108 metadata files intended to facilitate and document the storage and
109 transfer of the bag. Processing a bag does not require any
110 understanding of the payload file contents and the payload files can
111 be accessed without processing the BagIt metadata.
113 The name, BagIt, is inspired by the "enclose and deposit" method
114 [ENCDEP], sometimes referred to as "bag it and tag it". BagIt
115 differs from serialized archive formats such as MIME, TAR, or ZIP in
116 two general areas:
118 1. Strong integrity assurances. The format supports cryptographic-
119 quality hash algorithms (see Section 2.4) and allows for in-place
120 upgrades to add additional manifests using stronger algorithms
121 without breaking backwards compatibility.
123 2. Direct file access. Because BagIt specifies an actual filesystem
124 hierarchy rather than a serialized representation of one, files
125 can be accessed using standard operating system utilities,
126 implementations do not need to process a potentially large
127 archive file to extract a subset of data, and the format imposes
128 no size limits for either individual files or a bag.
130 BagIt is widely used for preserving digital assets originating from
131 different domains. Organizations involved in digital preservation
132 with BagIt include the Library of Congress, Dryad Data Repository,
133 NSF DataONE, and the Rockefeller Archive Center. Software
134 implementations are available for many languages including Python,
135 Ruby, Java, Perl, and PHP. It is also used in the libraries of many
136 universities, such as Cornell, Purdue, Stanford, Ghent University,
137 New York University, and the University of California.
139 1.2. Requirements
141 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
142 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
143 "OPTIONAL" in this document are to be interpreted as described in BCP
144 14 [RFC2119] [RFC8174] when, and only when, they appear in all
145 capitals as shown here.
147 Implementers are strongly encouraged to review the interoperability
148 considerations described in Section 6.1.
150 1.3. Terminology
152 The following terms have precise definitions as used in this
153 document:
155 bag A set of opaque files contained within the structure defined by
156 this document.
158 bag declaration The file required to be in all bags conforming to
159 this document. Contains values necessary to process the rest of a
160 bag. See Section 2.1.1.
162 bag checksum algorithm The name of a cryptographic checksum
163 algorithm which has been normalized for use in a manifest or tag
164 manifest file name (e.g. "sha512") as described in Section 2.4.
166 manifest A tag file thats maps filepaths to checksums. A manifest
167 can be a payload manifest Section 2.1.3 or a tag manifest
168 Section 2.2.1.
170 payload The data encapsulated by the bag as a set of named files,
171 which may be organized in sub-directories. The contents of the
172 payload files are opaque to this document, and, with respect to
173 BagIt processing, are always considered as sequences of
174 uninterpreted octets. See Section 2.1.2.
176 tag directory A directory that contains one or more tag files.
178 tag file A file which contains metadata about the bag or its
179 payload. This document defines the standard BagIt tag files: the
180 bag declaration in "bagit.txt" Section 2.1.1, payload manifests
181 Section 2.1.3, tag manifests Section 2.2.1, bag metadata in "bag-
182 info.txt" Section 2.2.2, and remote payload in "fetch.txt"
183 Section 2.2.3. This document also allows other arbitrary tag
184 files as described in Section 2.2.4.
186 complete A bag which contains every element required by this
187 document, every payload file listed in a manifest, and any
188 optional files which are listed in a tag manifest. See Section 3.
190 valid A complete bag where every checksum in every manifest has been
191 successfully verified against the corresponding file.
193 2. Structure
195 A bag MUST consist of a base directory containing:
197 1. a set of required and optional tag files Section 2.2
199 2. a sub-directory named "data", called the payload directory.
200 Section 2.1.2
202 3. a set of optional tag directories
204 The tag files in the base directory consist of one or more files
205 named "manifest-_algorithm_.txt" (see Section 2.1.3 and Section 2.4),
206 a file named "bagit.txt" (see Section 2.1.1), and zero or more
207 additional tag files (see Section 2.2). The tag files and
208 directories are in arbitrary file hierarchies and MAY have any name
209 that is not reserved for a file or directory in this document.
211 The base directory can have any name.
213 /
214 |
215 +-- bagit.txt
216 |
217 +-- manifest-.txt
218 |
219 +-- [additional tag files]
220 |
221 +-- data/
222 | |
223 | +-- [payload files]
224 |
225 +-- [tag directories]/
226 |
227 +-- [tag files]
229 2.1. Required Elements
231 2.1.1. Bag Declaration: bagit.txt
233 The "bagit.txt" tag file MUST consist of exactly two lines in this
234 order:
236 BagIt-Version: M.N
237 Tag-File-Character-Encoding: ENCODING
239 _M.N_ identifies the BagIt major (M) and minor (N) version numbers.
240 _ENCODING_ identifies the character set encoding used by the
241 remaining tag files. _ENCODING_ SHOULD be "UTF-8" but for backwards
242 compatibility it MAY be any other encoding registered in [RFC2978].
243 The bag declaration itself MUST be encoded in UTF-8, and MUST NOT
244 contain a byte-order mark (BOM) [RFC3629].
246 The number for this version of BagIt is "1.0".
248 2.1.2. Payload Directory: data/
250 The base directory MUST contain a sub-directory named "data".
252 The payload directory contains the arbitrary digital content within
253 the bag. The files under the payload directory are called payload
254 files, or the payload. Each payload file is treated as an opaque
255 octet stream when verifying file correctness. Payload files MAY be
256 organized in arbitrary sub-directory structures within the payload
257 directory, however for the purpose of this document such sub-
258 directory structures and filenames have no given meaning.
260 2.1.3. Payload Manifest: manifest-algorithm.txt
262 A payload manifest file provides a complete listing of each payload
263 file name along with a corresponding checksum to permit data
264 integrity checking. A bag can have more than one payload manifest,
265 with each using a different checksum algorithm. Manifest entries
266 MUST satisfy the following constraints:
268 o Every bag MUST contain at least one payload manifest file and MAY
269 contain more than one.
271 o Every payload manifest MUST list every payload file name exactly
272 once.
274 o A payload manifest file MUST have a name of the form "manifest-
275 _algorithm_.txt", where _algorithm_ is a string specifying the
276 checksum algorithm used by that manifest as described in
277 Section 2.4.
279 Example payload manifest filenames
281 manifest-sha256.txt
282 manifest-sha512.txt
283 Each line of a payload manifest file MUST be of the form:
285 checksum filepath
287 where _filepath_ is the pathname of a file relative to the base
288 directory, and _checksum_ is a hex-encoded checksum calculated
289 according to _algorithm_ over every octet in the file.
291 o The hex-encoded checksum MAY use uppercase and/or lowercase
292 letters.
294 o The slash character ('/') MUST be used as a path separator in
295 _filepath_.
297 o One or more linear whitespace characters (spaces or tabs) MUST
298 separate _checksum_ from _filepath_.
300 o There is no limitation on the length of a pathname.
302 o The payload manifest MUST NOT reference files outside the payload
303 directory.
305 o If a _filepath_ includes a line feed (LF), a carriage return (CR),
306 carriage return plus line feed (CRLF) or percent sign (%), those
307 characters (and only those) MUST be percent-encoded following
308 [RFC3986].
310 A manifest MUST NOT reference directories. Bag creators who wish to
311 create an otherwise empty directory have typically done so by
312 creating an empty placeholder file with a name such as ".keep".
314 2.2. Optional Elements
316 2.2.1. Tag Manifest: tagmanifest-algorithm.txt
318 A tag manifest is a tag file that lists other tag files and checksums
319 for those tag files generated using a particular bag checksum
320 algorithm.
322 A bag MAY contain one or more tag manifests, in which case each tag
323 manifest SHOULD list the same set of tag files.
325 Each tag manifest MUST list every payload manifest. Each tag
326 manifest MUST NOT list any tag manifests, but SHOULD list the
327 remaining tag files present in the bag.
329 A tag manifest file MUST have a name of the form "tagmanifest-
330 _algorithm_.txt", where _algorithm_ is a string following the format
331 described in Section 2.4 specifying the bag checksum algorithm used
332 in that manifest.
334 Tag manifests SHOULD use the same algorithms as the payload manifests
335 that are present in the bag.
337 Example tag manifest filenames:
339 tagmanifest-sha256.txt
340 tagmanifest-sha512.txt
342 A tag manifest file has the same form as the payload file manifest
343 file described in Section 2.1.3, but MUST NOT list any payload files.
344 As a result, no _filepath_ listed in a tag manifest begins "data/".
346 2.2.2. Bag Metadata: bag-info.txt
348 The "bag-info.txt" file is a tag file that contains metadata elements
349 describing the bag and the payload. The metadata elements contained
350 in the "bag-info.txt" file are intended primarily for human use. All
351 metadata elements are OPTIONAL and MAY be repeated. Because "bag-
352 info.txt" is intended for human reading and editing, ordering MAY be
353 significant and the ordering of metadata elements MUST be preserved.
355 A metadata element MUST consist of a label, a colon ":", a single
356 linear whitespace character (space or tab), and a value, terminated
357 with a line feed (CR), carriage return (LF) or carriage return plus
358 line feed (CRLF).
360 The label MUST NOT contain colon (:), line feeds (LF) or carriage
361 returns (CR). The label MAY contain linear whitespace characters,
362 but MUST NOT start or end with whitespace.
364 It is RECOMMENDED that lines not exceed 79 characters in length.
365 Long values MAY be continued onto the next line by inserting a line
366 feed (LF), a carriage return (CR), or carriage return plus line feed
367 (CRLF) and indenting the next line with one or more linear white
368 space (spaces or tabs). Except for linebreaks such padding does not
369 form part of the value.
371 Implementations wishing to support previous BagIt versions MUST
372 accept multiple linear whitespace before and after the colon when the
373 bag version is earlier than 1.0; such whitespace does not form part
374 of the label or value.
376 The following are reserved metadata elements. The use of these
377 reserved metadata elements are OPTIONAL but encouraged. Reserved
378 metadata element names are case-insensitive. Except where indicated
379 otherwise, these metadata element names MAY be repeated to capture
380 multiple values.
382 Source-Organization Organization transferring the content.
384 Organization-Address Mailing address of the source organization.
386 Contact-Name Person at the source organization who is responsible
387 for the content transfer.
389 Contact-Phone International format telephone number of person or
390 position responsible.
392 Contact-Email Fully qualified email address of person or position
393 responsible.
395 External-Description A brief explanation of the contents and
396 provenance.
398 Bagging-Date Date (YYYY-MM-DD) that the content was prepared for
399 transfer. This metadata element SHOULD NOT be repeated.
401 External-Identifier A sender-supplied identifier for the bag.
403 Bag-Size Size or approximate size of the bag being transferred,
404 followed by an abbreviation such as MB (megabytes), GB, or TB; for
405 example, 42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum
406 (described next), Bag-Size is intended for human consumption.
407 This metadata element SHOULD NOT be repeated.
409 Payload-Oxum The "octetstream sum" of the payload, intended for the
410 purpose of quickly detecting incomplete bags before performing
411 checksum validation. This is strictly an optimization and
412 implementations MUST perform the standard checksum validation
413 process before proclaiming a bag to be valid. This element MUST
414 NOT be present more than once and, if present, MUST be in the form
415 "_OctetCount_._StreamCount_", where _OctetCount_ is the total
416 number of octets (8-bit bytes) across all payload file content and
417 _StreamCount_ is the total number of payload files. This metadata
418 element MUST NOT be repeated.
420 Bag-Group-Identifier A sender-supplied identifier for the set, if
421 any, of bags to which it logically belongs. This identifier
422 SHOULD be unique across the sender's content, and if recognizable
423 as belonging to a globally unique scheme, the receiver SHOULD make
424 an effort to honor reference to it. This metadata element SHOULD
425 NOT be repeated.
427 Bag-Count Two numbers separated by "of", in particular, "N of T",
428 where T is the total number of bags in a group of bags and N is
429 the ordinal number within the group; if T is not known, specify it
430 as "?" (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of
431 145. This metadata element SHOULD NOT be repeated. If this
432 metadata element is present, it is RECOMMENDED to also include the
433 Bag-Group-Identifier element.
435 Internal-Sender-Identifier An alternate sender-specific identifier
436 for the content and/or bag.
438 Internal-Sender-Description A sender-local explanation of the
439 contents and provenance.
441 In addition to these metadata elements, other arbitrary metadata
442 elements MAY also be present.
444 An example "bag-info.txt" file
446 Source-Organization: FOO University
447 Organization-Address: 1 Main St., Cupertino, California, 11111
448 Contact-Name: Jane Doe
449 Contact-Phone: +1 111-111-1111
450 Contact-Email: example@example.com
451 External-Description: Uncompressed greyscale TIFF images from the
452 FOO papers colle...
453 Bagging-Date: 2008-01-15
454 External-Identifier: university_foo_001
455 Payload-Oxum: 279164409832.1198
456 Bag-Group-Identifier: university_foo
457 Bag-Count: 1 of 15
458 Internal-Sender-Identifier: /storage/images/foo
459 Internal-Sender-Description: Uncompressed greyscale TIFFs created
460 from microfilm and are...
462 2.2.3. Fetch File: fetch.txt
464 For reasons of efficiency, a bag MAY be sent with a list of files to
465 be fetched and added to the payload before it can meaningfully be
466 checked for completeness. The fetch file allows a bag to be
467 transmitted with "holes" in it, which can be practical for several
468 reasons. For example, it obviates the need for the sender to stage a
469 large serialized copy of the content while the bag is transferred to
470 the receiver. Also, this method allows a sender to construct a bag
471 from components that are either a subset of logically related
472 components (e.g., the localized logical object could be much larger
473 than what is intended for export) or assembled from logically
474 distributed sources (e.g., the object components for export are not
475 stored locally under one filesystem tree). An OPTIONAL tag file
476 called the fetch file contains such a list.
478 The fetch file MUST be named "fetch.txt". Every file listed in the
479 fetch file MUST be listed in every payload manifest. A fetch file
480 MUST NOT list any tag files.
482 Each line of a fetch file MUST be of the form:
484 url length filepath
486 where _url_ identifies the file to be fetched and MUST be an absolute
487 URI as defined in [RFC3986], _length_ is the number of octets in the
488 file (or "-", to leave it unspecified), and _filepath_ identifies the
489 corresponding payload file, relative to the base directory.
491 The slash character ('/') MUST be used as a path separator in
492 _filepath_. One or more linear whitespace characters (spaces or tabs)
493 MUST separate these three values, and any such characters in the
494 _url_ MUST be percent-encoded [RFC3986]. If _filename_ includes a
495 line feed (LF), a carriage return (CR), carriage return plus line
496 feed (CRLF) or percent sign (%), those characters (and only those)
497 MUST be percent-encoded following [RFC3986]. There is no limitation
498 on the length of any of the fields in the fetch file.
500 2.2.4. Other Tag Files
502 A bag MAY contain other tag files that are not defined by this
503 document. Implementations MUST perform standard checksum validation
504 on any tag file which is listed in a tag manifest but MUST otherwise
505 ignore their contents.
507 2.3. Text Tag File Format
509 All tag files specifically described in this document MUST adhere to
510 the text tag file format described below. Other tag files MAY adhere
511 to the text tag file format described below.
513 Text tag files are line-oriented, and each line MUST be terminated by
514 a line feed (LF), a carriage return (CR), or carriage return plus
515 newline (CRLF). It is RECOMMENDED that the last line in a tag file
516 also ends with LF, CR, or CRLF. Text tag file names MUST end in the
517 extension ".txt".
519 In all text tag files except for the bag declaration file, text MUST
520 use the character encoding specified in the "bagit.txt" bag
521 declaration file. Text tag files except for the bag declaration file
522 MAY include a byte-order mark (BOM) only if the specified encoding
523 requires it for proper decoding. In accordance with [RFC3629], when
524 "bagit.txt" specifies UTF-8 the tag files MUST NOT begin with a byte-
525 order mark (BOM). See Section 2.1.1
527 The use of UTF-8 for text tag files is strongly RECOMMENDED. A
528 future version of BagIt may disallow encodings other than UTF-8.
530 2.4. Bag Checksum Algorithms
532 The payload manifest and tag manifests permit validating the
533 integrity of the payload and tag files in a bag produced by the
534 checksum algorithms. Checksum values MUST be encoded so as to
535 conform to the manifest format specified in Section 2.1.3. However,
536 the internal details of a checksum are outside the scope of this
537 document.
539 To avoid future ambiguity, the checksum algorithm SHOULD be
540 registered in IANA's "Named Information Hash Algorithm Registry"
541 [ni-registry] according to [RFC6920], but MAY for backwards
542 compatibility also be MD5 [RFC1321] or SHA-1 [RFC3174].
544 The name of the checksum algorithm MUST be normalized for use in the
545 manifest's filename by lowercasing the common name of the algorithm
546 and removing all non-alphanumeric characters. Following is a partial
547 list mapping common algorithm names to normalized names:
549 o MD5: md5
551 o SHA-1: sha1
553 o sha-256: sha256
555 o sha-512: sha512
557 Starting with BagIt 1.0, bag creation and validation tools MUST
558 support the SHA-256 and SHA-512 algorithms [RFC6234] and SHOULD
559 enable SHA-512 by default when creating new bags. For backwards
560 compatibility implementers SHOULD support MD5 [RFC1321] and SHA-1
561 [RFC3174]. Implementers are encouraged to simplify the process of
562 adding additional manifests using new algorithms to streamline the
563 process of in-place upgrades.
565 3. Complete and Valid bags
567 A _complete_ bag MUST meet the following requirements:
569 1. Every required element MUST be present (Section 2.1).
571 2. Every file listed in every tag manifest MUST be present.
573 3. Every file listed in every payload manifest MUST be present.
575 4. For BagIt 1.0, every payload file MUST be listed in every payload
576 manifest. Note that older versions of BagIt allowed payload
577 files to be listed in just one of the manifests.
579 5. Every element present MUST conform to BagIt 1.0.
581 A _valid_ bag MUST meet the following requirements:
583 1. The bag MUST be _complete_.
585 2. Every checksum in every payload manifest and tag manifest has
586 been successfully verified against the contents of the
587 corresponding file.
589 4. Examples
591 4.1. Example of a basic bag
593 This is the layout of a basic bag containing an image and a companion
594 OCR file. Lines of file content are shown with added parentheses to
595 indicate each complete line. For brevity this example uses MD5
596 rather than the recommended SHA-512.
598 myfirstbag/
599 |
600 | manifest-md5.txt
601 | (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png)
602 | (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt)
603 |
604 | bagit.txt
605 | (BagIt-version: 1.0 )
606 | (Tag-File-Character-Encoding: UTF-8 )
607 |
608 \--- data/
609 |
610 | 27613-h/images/q172.png
611 | (... image bytes ... )
612 |
613 | 27613-h/images/q172.txt
614 | (... OCR text ... )
615 ....
617 4.2. Example bag using fetch.txt
619 This is the layout of a bag which expects the receiver to download
620 the files listed in the payload manifests prior to validation. Lines
621 of file content are shown with added parentheses to indicate each
622 complete line. For brevity this example uses MD5 rather than the
623 recommended SHA-512.
625 highsmith-tahoe/
626 |
627 | manifest-md5.txt
628 | (102b0e6effe208ef9b29864946de9e22 data/23364a.tif )
629 |
630 | fetch.txt
631 | (https://cdn.loc.gov/master/pnp/highsm/23300/23364a.tif
632 | 216951362 data/23364a.tif )
633 |
634 | bagit.txt
635 | (BagIt-version: 1.0 )
636 | (Tag-File-Character-Encoding: UTF-8 )
637 |
638 | bag-info.txt
639 | (Internal-Sender-Description: Download link found at )
640 | ( https://www.loc.gov/resource/highsm.23364/ )
642 5. Security Considerations
644 5.1. Special directory characters
646 The paths specified in the payload manifests, tag manifests, and
647 fetch files do not prohibit special directory characters which have
648 special meaning on some operating systems. Implementers MUST ensure
649 that files outside the bag directory structure are not accessed when
650 reading or writing files based on paths specified in a bag.
652 All implementations SHOULD have a test suite to guard against special
653 directory characters.
655 For example, a maliciously crafted "tagmanifest-sha512.txt" file
656 might contain entries which begin with a path character such as "/",
657 "..", or a "~username" home directory reference in an attempt to
658 cause a naive implementation to leak or overwrite targeted files on a
659 POSIX operating system.
661 Windows implementations SHOULD test their implementations to ensure
662 that safety-checks prevent use of drive letters and the less commonly
663 used namespace sequences (e.g. "\\?\C:\...") described in [MSFNAM].
665 To assist implementers, the Library of Congress conformance suite
666 [LC-CONFORMANCE-SUITE] has some tests for invalid bags which are
667 expected to fail on POSIX or Windows clients.
669 5.2. Control of URLs in fetch.txt
671 Implementers of tools that complete bags by retrieving URLs listed in
672 a fetch file need to be aware that some of those URLs might point to
673 hosts, intentionally or unintentionally, that are not under control
674 of the bag's sender. Moreover, older checksum algorithms, even if
675 reasonable for detecting corruption during transit, may not offer
676 strong cryptographic protection against intentional spoofing.
678 5.3. File sizes in fetch.txt
680 The size of files, as optionally reported in the fetch file, cannot
681 be guaranteed to match the actual file size to be downloaded.
682 Implementers SHOULD take steps to monitor and abort transfer when the
683 received file size exceeds the file size reported in the fetch file.
684 Implementers SHOULD NOT use the file size in the fetch file for
685 critical resource allocation, such as buffer sizing or storage
686 requisitioning.
688 6. Practical Considerations (non-normative)
690 6.1. Interoperability
692 This section lists practical considerations for implementers and
693 users. None of the points below are required but they are
694 recommended for general-purpose usage.
696 Upon discovering errors in bags, an implementation is free to take
697 action (for example, logging or reporting) in an application-specific
698 manner. This document does not mandate any particular action.
700 The Library of Congress conformance suite [LC-CONFORMANCE-SUITE] is
701 provided as a public resource to test new implementations for
702 compatibility and error handling.
704 6.1.1. Filename normalization
706 This section provides background information on various challenges
707 caused by differences in how operating systems, filesystems, and
708 common tools handle filenames followed by a list of recommendations
709 for implementers in Section 6.1.1.3.
711 6.1.1.1. Case sensitivity
713 There are two challenges for interoperability related to filename
714 case:
716 o Filesystems such as FAT or EXFAT always convert filenames to
717 uppercase: "example.txt" will be stored as "EXAMPLE.TXT"
719 o Many Unix filesystems save filenames exactly as provided, allowing
720 multiple files which differ only in case: "example.txt" and
721 "Example.txt" are separate files
723 o NTFS and Apple's HFS Plus usually preserve case when storing files
724 but are case-insensitive when retrieving them. A file saved as
725 "Example.txt" will be retrieved by that name but will also be
726 retrieved as "EXAMPLE.TXT", "example.txt", etc.
728 6.1.1.2. Unicode normalization
730 The Unicode specification has common cases where different character
731 sequences produce the same human-meaningful text. These are referred
732 to as "canonically equivalent" and the Unicode specification defines
733 different normalization forms -- see [UNICODE-TR15] for the full
734 details and a brief example below:
736 The common surname "Nunez" normalized in different forms
738 Normalization Form D (Decomposition):
740 Char UTF8 Hex Name
741 ----------------------------------------------
742 N 4e LATIN CAPITAL LETTER N
743 u 75 LATIN SMALL LETTER U
744 \u0301 cc81 COMBINING ACUTE ACCENT
745 n 6e LATIN SMALL LETTER N
746 \u0303 cc83 COMBINING TILDE
747 e 65 LATIN SMALL LETTER E
748 z 7a LATIN SMALL LETTER Z
750 Normalization Form C (Canonical Composition):
752 Char UTF8 Hex Name
753 ----------------------------------------------
754 N 4e LATIN CAPITAL LETTER N
755 u c3ba LATIN SMALL LETTER U WITH ACUTE
756 n c3b1 LATIN SMALL LETTER N WITH TILDE
757 e 65 LATIN SMALL LETTER E
758 z 7a LATIN SMALL LETTER Z
759 Unicode normalization is relevant to BagIt implementors because
760 different systems have different standards for normalization:
762 o Apple's HFS Plus filesystem always normalizes filenames to a
763 fully-decomposed form based on the Unicode 2.0 specification (see
764 [TN1150]).
766 o Windows treats filenames as opaque character sequences (see
767 [MSFNAM]) and will store and return the encoded bytes exactly as
768 provided.
770 o Linux and other common Unix systems are generally similar to
771 Windows in storing and returning opaque byte streams but this
772 behaviour is technically filesystem-dependent.
774 o Utilities used for file management, transfer, and archival may
775 ignore this issue, apply an arbitrary normalization form, or allow
776 the user to control how normalization is applied.
778 In practice, this means that the encoded filename stored in a
779 manifest may fail a simple file existence check because the
780 filename's normalization was changed at some point after the manifest
781 was written. This situation is very confusing for users because the
782 filenames are visually indistinguishable and the "missing" file is
783 obviously present in the payload directory.
785 6.1.1.3. Recommendations
787 o Implementations SHOULD discourage the creation of bags containing
788 files which differ only in case.
790 o Implementations SHOULD prevent the creation of bags containing
791 files which differ only in normalization form.
793 o BagIt implementations SHOULD tolerate differences in normalization
794 form by comparing both the list of filesystem and manifest names
795 after applying the same normalization form to both.
797 o Implementations SHOULD issue a warning when multiple manifests are
798 present which differ only in case or normalization form.
800 6.1.2. Windows and Unix file naming
802 As specified above, only the Unix-based path separator ('/') may be
803 used inside filenames listed in BagIt manifest and fetch.txt files.
804 When bags are exchanged between Windows and Unix platforms, the path
805 separator SHOULD be translated as needed. Receivers of bags on
806 physical media SHOULD be prepared for filesystems created under
807 either Windows or Unix. Besides the fundamental difference between
808 path separators ('\' and '/'), generally, Windows filesystems have
809 more limitations than Unix filesystems.
811 Windows path names have a maximum of 255 characters, and none of
812 these characters may be used in a path component:
814 < > : " / | ? *
816 Windows also reserves the following names, with or without a file
817 extension:
819 CON, PRN, AUX, NUL
820 COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9
821 LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9
823 See [MSFNAM] for more information and possible alternatives.
825 6.1.3. Legacy checksum tools
827 Some bags have been manually assembled using checksum utilities such
828 as those contained in the GNU Coreutils package (md5sum, sha1sum,
829 etc.), collectively referred to here as "md5sum". Implementers who
830 desire wide support of legacy content should be aware of some known
831 quirks of these tools:
833 md5sum can be run in "text mode" which causes it to normalize line-
834 endings on some operating systems. On Unix-like systems both modes
835 will usually produce the same results but on systems like Windows
836 they can produce different results based on the file contents. The
837 md5sum output format has two characters between the checksum and the
838 filepath: the first is always a space and the second is an asterisk
839 ("*") for binary mode and a space for text mode.
841 A final note about md5sum-generated manifests is that for a
842 _filepath_ containing a backslash ('\'), the manifest line will have
843 a backslash inserted in front of the _checksum_ and, under Windows,
844 the backslashes inside _filepath_ can be doubled.
846 Implementers MAY wish to accept this format by ignoring a leading
847 asterisk or handling differences in line termination gracefully but,
848 if so, implementations MUST warn the user that the bag in question
849 will fail strict validation. In such cases it is RECOMMENDED that
850 tools provide an easy option to update the bag with valid manifests.
852 7. Augmented Backus-Naur Form (non-normative)
854 The Augmented Backus-Naur Form (ABNF) rules provided below are non-
855 normative. If there is a discrepancy between requirements in the
856 normative sections and the ABNF, the requirements in the normative
857 sections prevail. Some definitions use the core rules (e.g. DIGIT,
858 HEXDIG, etc) as defined in [RFC4234]
860 7.1. Bag Declaration: bagit.txt
862 bagit.txt ABNF rules:
864 bagit-txt = "BagIt-Version: " 1*DIGIT "." 1*DIGIT ending
865 "Tag-File-Character-Encoding: " encoding ending
866 encoding = 1*CHAR
867 ending = CR / LF / CRLF
869 7.2. Payload Manifest: manifest-algorithm.txt
871 Payload Manifest ABNF rules:
873 payload-manifest = 1*payload-manifest-line
874 payload-manifest-line = checksum 1*WSP filepath ending
875 checksum = 1*case-hexdig
876 case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" /
877 "D" / "d" / "E"/ "e"/ "F" / "f"
878 filepath = "data/"
879 1*( unreserved / pct-encoded / sub-delims )
880 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
881 sub-delims = "!" / "$" / "&" / DQUOTE / "'" / "(" / ")" /
882 "*" / "+" / "," / ";" / "=" / "/"
883 pct-encoded = "%0D" / "%0d" / "%0A" / "%0a" / "%25"
884 ending = CR / LF / CRLF
886 7.3. Bag Metadata: bag-info.txt
888 bag-info.txt ABNF rules:
890 metadata = 1*metadata-line
891 metadata-line = key ":" WSP value ending *(continuation ending)
892 key = 1*non-reserved
893 value = 1*non-reserved
894 continuation = WSP 1*non-reserved
895 non-reserved = VCHAR / WSP
896 ; any valid character for the specific encoding
897 ; except those that match "ending"
898 ending = CR / LF / CRLF
900 7.4. Fetch File: fetch.txt
902 fetch.txt ABNF rules:
904 fetch = 1*fetch-line
905 fetch-line = url 1*WSP length 1*WSP filepath ending
906 url =
907 length = 1*DIGIT / "-"
908 filepath = ("data/"
909 1*( unreserved / pct-encoded / sub-delims ))
910 ending = CR / LF / CRLF
912 8. Contributors
914 Additional contributors to the authoring of BagIt are Andy Boyko,
915 David Brunton, Rosie Storey, Ed Summers, Brian Vargas, and Kate
916 Zwaard.
918 9. Acknowledgements
920 BagIt benefitted from the thoughtful assistance of Stephen Abrams,
921 Mike Ashenfelder, Dan Chudnov, Dave Crocker, Scott Fisher, Brad
922 Hards, Erik Hetzner, Keith Johnson, Leslie Johnston, David Loy, Mark
923 Phillips, Tracy Seneca, Stian Soiland-Reyes, Brian Tingle, Adam
924 Turoff, and Jim Tuttle.
926 10. IANA Considerations
928 This draft does not request any action from IANA.
930 11. References
932 11.1. Normative References
934 [ni-registry]
935 IANA, "Named Information Hash Algorithm Registry", 9 2016,
936 .
939 [RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321,
940 DOI 10.17487/RFC1321, April 1992,
941 .
943 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
944 Requirement Levels", BCP 14, RFC 2119,
945 DOI 10.17487/RFC2119, March 1997,
946 .
948 [RFC2978] Freed, N. and J. Postel, "IANA Charset Registration
949 Procedures", BCP 19, RFC 2978, DOI 10.17487/RFC2978,
950 October 2000, .
952 [RFC3174] Eastlake 3rd, D. and P. Jones, "US Secure Hash Algorithm 1
953 (SHA1)", RFC 3174, DOI 10.17487/RFC3174, September 2001,
954 .
956 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
957 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
958 2003, .
960 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
961 Resource Identifier (URI): Generic Syntax", STD 66,
962 RFC 3986, DOI 10.17487/RFC3986, January 2005,
963 .
965 [RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms
966 (SHA and SHA-based HMAC and HKDF)", RFC 6234,
967 DOI 10.17487/RFC6234, May 2011,
968 .
970 [RFC6920] Farrell, S., Kutscher, D., Dannewitz, C., Ohlman, B.,
971 Keranen, A., and P. Hallam-Baker, "Naming Things with
972 Hashes", RFC 6920, DOI 10.17487/RFC6920, April 2013,
973 .
975 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
976 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
977 May 2017, .
979 11.2. Informative References
981 [ENCDEP] Tabata, K., "A Collaboration Model between Archival
982 Systems to Enhance the Reliability of Preservation by an
983 Enclose-and-Deposit Method", 2005,
984 .
986 [LC-CONFORMANCE-SUITE]
987 The Library of Congress, "BagIt Conformance Suite", 2016-,
988 .
991 [MSFNAM] Microsoft, Inc., "Naming a File", 2008,
992 .
994 [RFC4234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
995 Specifications: ABNF", RFC 4234, DOI 10.17487/RFC4234,
996 October 2005, .
998 [TN1150] Apple Inc., "Technical Note TN1150: HFS Plus Volume
999 Format", 3 2004,
1000 .
1003 [UNICODE-TR15]
1004 Unicode Consortium, "Unicode(R) Standard Annex #15:
1005 Unicode Normalization Forms", 2 2016,
1006 .
1008 Authors' Addresses
1010 John A. Kunze
1011 California Digital Library
1012 415 20th St, 4th Floor
1013 Oakland, CA 94612
1014 US
1016 Email: jak@ucop.edu
1018 Justin Littman
1019 George Washington University Libraries
1020 2130 H St NW
1021 Washington, DC 20052
1022 USA
1024 Email: justinlittman@gwu.edu
1026 Liz Madden
1027 Library of Congress
1028 101 Independence Avenue SE
1029 Washington, DC 20540
1030 USA
1032 Email: emad@loc.gov
1033 John Scancella
1034 Library of Congress
1035 101 Independence Avenue SE
1036 Washington, DC 20540
1037 USA
1039 Email: jsca@loc.gov
1041 Chris Adams
1042 Library of Congress
1043 101 Independence Avenue SE
1044 Washington, DC 20540
1045 USA
1047 Email: cadams@loc.gov