idnits 2.17.1
draft-kunze-bagit-17.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
No issues found here.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the IETF Trust and authors Copyright Line does not
match the current year
== The document seems to lack the recommended RFC 2119 boilerplate, even if
it appears to use RFC 2119 keywords -- however, there's a paragraph with
a matching beginning. Boilerplate error?
(The document does seem to have the reference to RFC 2119 which the
ID-Checklist requires).
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- The document date (September 17, 2018) is 2046 days in the past. Is
this intentional?
Checking references for intended status: Informational
----------------------------------------------------------------------------
-- Obsolete informational reference (is this intentional?): RFC 4234
(Obsoleted by RFC 5234)
Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group J. Kunze
3 Internet-Draft California Digital Library
4 Intended status: Informational J. Littman
5 Expires: March 21, 2019 Stanford Libraries
6 E. Madden
7 Library of Congress
8 J. Scancella
10 C. Adams
11 Library of Congress
12 September 17, 2018
14 The BagIt File Packaging Format (V1.0)
15 draft-kunze-bagit-17
17 Abstract
19 This document describes BagIt, a set of hierarchical file layout
20 conventions for storage and transfer of arbitrary digital content. A
21 "bag" has just enough structure to enclose descriptive metadata
22 "tags" and a file "payload" but does not require knowledge of the
23 payload's internal semantics. This BagIt format is suitable for
24 reliable storage and transfer.
26 Status of This Memo
28 This Internet-Draft is submitted in full conformance with the
29 provisions of BCP 78 and BCP 79.
31 Internet-Drafts are working documents of the Internet Engineering
32 Task Force (IETF). Note that other groups may also distribute
33 working documents as Internet-Drafts. The list of current Internet-
34 Drafts is at https://datatracker.ietf.org/drafts/current/.
36 Internet-Drafts are draft documents valid for a maximum of six months
37 and may be updated, replaced, or obsoleted by other documents at any
38 time. It is inappropriate to use Internet-Drafts as reference
39 material or to cite them other than as "work in progress."
41 This Internet-Draft will expire on March 21, 2019.
43 Copyright Notice
45 Copyright (c) 2018 IETF Trust and the persons identified as the
46 document authors. All rights reserved.
48 This document is subject to BCP 78 and the IETF Trust's Legal
49 Provisions Relating to IETF Documents
50 (https://trustee.ietf.org/license-info) in effect on the date of
51 publication of this document. Please review these documents
52 carefully, as they describe your rights and restrictions with respect
53 to this document. Code Components extracted from this document must
54 include Simplified BSD License text as described in Section 4.e of
55 the Trust Legal Provisions and are provided without warranty as
56 described in the Simplified BSD License.
58 Table of Contents
60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
61 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 3
62 1.2. Requirements . . . . . . . . . . . . . . . . . . . . . . 4
63 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4
64 2. Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 5
65 2.1. Required Elements . . . . . . . . . . . . . . . . . . . . 5
66 2.1.1. Bag Declaration: bagit.txt . . . . . . . . . . . . . 6
67 2.1.2. Payload Directory: data/ . . . . . . . . . . . . . . 6
68 2.1.3. Payload Manifest: manifest-algorithm.txt . . . . . . 6
69 2.2. Optional Elements . . . . . . . . . . . . . . . . . . . . 7
70 2.2.1. Tag Manifest: tagmanifest-algorithm.txt . . . . . . . 7
71 2.2.2. Bag Metadata: bag-info.txt . . . . . . . . . . . . . 8
72 2.2.3. Fetch File: fetch.txt . . . . . . . . . . . . . . . . 10
73 2.2.4. Other Tag Files . . . . . . . . . . . . . . . . . . . 11
74 2.3. Text Tag File Format . . . . . . . . . . . . . . . . . . 11
75 2.4. Bag Checksum Algorithms . . . . . . . . . . . . . . . . . 12
76 3. Complete and Valid bags . . . . . . . . . . . . . . . . . . . 13
77 4. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 13
78 4.1. Example of a basic bag . . . . . . . . . . . . . . . . . 13
79 4.2. Example bag using fetch.txt . . . . . . . . . . . . . . . 14
80 5. Security Considerations . . . . . . . . . . . . . . . . . . . 15
81 5.1. Special directory characters . . . . . . . . . . . . . . 15
82 5.2. Control of URLs in fetch.txt . . . . . . . . . . . . . . 15
83 5.3. File sizes in fetch.txt . . . . . . . . . . . . . . . . . 15
84 5.4. Attacks on payload file content . . . . . . . . . . . . . 16
85 6. Practical Considerations (non-normative) . . . . . . . . . . 16
86 6.1. Interoperability . . . . . . . . . . . . . . . . . . . . 16
87 6.1.1. Filename normalization . . . . . . . . . . . . . . . 16
88 6.1.2. Windows and Unix file naming . . . . . . . . . . . . 18
89 6.1.3. Legacy checksum tools . . . . . . . . . . . . . . . . 19
90 7. Augmented Backus-Naur Form (non-normative) . . . . . . . . . 19
91 7.1. Bag Declaration: bagit.txt . . . . . . . . . . . . . . . 19
92 7.2. Payload Manifest: manifest-algorithm.txt . . . . . . . . 20
93 7.3. Bag Metadata: bag-info.txt . . . . . . . . . . . . . . . 20
94 7.4. Fetch File: fetch.txt . . . . . . . . . . . . . . . . . . 20
95 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 20
96 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 21
97 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21
98 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 21
99 11.1. Normative References . . . . . . . . . . . . . . . . . . 21
100 11.2. Informative References . . . . . . . . . . . . . . . . . 22
101 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22
103 1. Introduction
105 1.1. Purpose
107 BagIt is a set of hierarchical file layout conventions designed to
108 support storage and transfer of arbitrary digital content. A bag
109 consists of a directory containing the payload files and other
110 accompanying metadata files known as "tag" files. The "tags" are
111 metadata files intended to facilitate and document the storage and
112 transfer of the bag. Processing a bag does not require any
113 understanding of the payload file contents and the payload files can
114 be accessed without processing the BagIt metadata.
116 The name, BagIt, is inspired by the "enclose and deposit" method
117 [ENCDEP], sometimes referred to as "bag it and tag it". BagIt
118 differs from serialized archive formats such as MIME, TAR, or ZIP in
119 two general areas:
121 1. Strong integrity assurances. The format supports cryptographic-
122 quality hash algorithms (see Section 2.4) and allows for in-place
123 upgrades to add additional manifests using stronger algorithms
124 without breaking backwards compatibility. This provides high
125 levels of confidence against data corruption but is not designed
126 to be secure against active attacks.
128 2. Direct file access. Because BagIt specifies an actual filesystem
129 hierarchy rather than a serialized representation of one, files
130 can be accessed using standard operating system utilities,
131 implementations do not need to process a potentially large
132 archive file to extract a subset of data, and the format imposes
133 no size limits for either individual files or a bag.
135 BagIt is widely used for preserving digital assets originating from
136 different domains. Organizations involved in digital preservation
137 with BagIt include the Library of Congress, Dryad Data Repository,
138 NSF DataONE, and the Rockefeller Archive Center. Software
139 implementations are available for many languages including Python,
140 Ruby, Java, Perl, and PHP. It is also used in the libraries of many
141 universities, such as Cornell, Purdue, Stanford, Ghent University,
142 New York University, and the University of California.
144 1.2. Requirements
146 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
147 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
148 "OPTIONAL" in this document are to be interpreted as described in BCP
149 14 [RFC2119] [RFC8174] when, and only when, they appear in all
150 capitals as shown here.
152 Implementers are strongly encouraged to review the interoperability
153 considerations described in Section 6.1.
155 1.3. Terminology
157 The following terms have precise definitions as used in this
158 document:
160 bag A set of opaque files contained within the structure defined by
161 this document.
163 bag declaration The file required to be in all bags conforming to
164 this document. Contains values necessary to process the rest of a
165 bag. See Section 2.1.1.
167 bag checksum algorithm The name of a cryptographic checksum
168 algorithm which has been normalized for use in a manifest or tag
169 manifest file name (e.g. "sha512") as described in Section 2.4.
171 manifest A tag file thats maps filepaths to checksums. A manifest
172 can be a payload manifest Section 2.1.3 or a tag manifest
173 Section 2.2.1.
175 payload The data encapsulated by the bag as a set of named files,
176 which may be organized in sub-directories. The contents of the
177 payload files are opaque to this document, and, with respect to
178 BagIt processing, are always considered as sequences of
179 uninterpreted octets. See Section 2.1.2.
181 tag directory A directory that contains one or more tag files.
183 tag file A file which contains metadata about the bag or its
184 payload. This document defines the standard BagIt tag files: the
185 bag declaration in "bagit.txt" Section 2.1.1, payload manifests
186 Section 2.1.3, tag manifests Section 2.2.1, bag metadata in "bag-
187 info.txt" Section 2.2.2, and remote payload in "fetch.txt"
188 Section 2.2.3. This document also allows other arbitrary tag
189 files as described in Section 2.2.4.
191 complete A bag which contains every element required by this
192 document, every payload file listed in a manifest, and any
193 optional files which are listed in a tag manifest. See Section 3.
195 valid A complete bag where every checksum in every manifest has been
196 successfully verified against the corresponding file.
198 2. Structure
200 A bag MUST consist of a base directory containing:
202 1. a set of required and optional tag files Section 2.2
204 2. a sub-directory named "data", called the payload directory.
205 Section 2.1.2
207 3. a set of optional tag directories
209 The tag files in the base directory consist of one or more files
210 named "manifest-_algorithm_.txt" (see Section 2.1.3 and Section 2.4),
211 a file named "bagit.txt" (see Section 2.1.1), and zero or more
212 additional tag files (see Section 2.2). The tag files and
213 directories are in arbitrary file hierarchies and MAY have any name
214 that is not reserved for a file or directory in this document.
216 The base directory can have any name.
218 /
219 |
220 +-- bagit.txt
221 |
222 +-- manifest-.txt
223 |
224 +-- [additional tag files]
225 |
226 +-- data/
227 | |
228 | +-- [payload files]
229 |
230 +-- [tag directories]/
231 |
232 +-- [tag files]
234 2.1. Required Elements
235 2.1.1. Bag Declaration: bagit.txt
237 The "bagit.txt" tag file MUST consist of exactly two lines in this
238 order:
240 BagIt-Version: M.N
241 Tag-File-Character-Encoding: ENCODING
243 _M.N_ identifies the BagIt major (M) and minor (N) version numbers.
244 _ENCODING_ identifies the character set encoding used by the
245 remaining tag files. _ENCODING_ SHOULD be "UTF-8" but for backwards
246 compatibility it MAY be any other encoding registered in
247 [cs-registry]. The bag declaration itself MUST be encoded in UTF-8,
248 and MUST NOT contain a byte-order mark (BOM) [RFC3629].
250 The number for this version of BagIt is "1.0".
252 2.1.2. Payload Directory: data/
254 The base directory MUST contain a sub-directory named "data".
256 The payload directory contains the arbitrary digital content within
257 the bag. The files under the payload directory are called payload
258 files, or the payload. Each payload file is treated as an opaque
259 octet stream when verifying file correctness. Payload files MAY be
260 organized in arbitrary sub-directory structures within the payload
261 directory, however for the purpose of this document such sub-
262 directory structures and filenames have no given meaning.
264 2.1.3. Payload Manifest: manifest-algorithm.txt
266 A payload manifest file provides a complete listing of each payload
267 file name along with a corresponding checksum to permit data
268 integrity checking. A bag can have more than one payload manifest,
269 with each using a different checksum algorithm. Manifest entries
270 MUST satisfy the following constraints:
272 o Every bag MUST contain at least one payload manifest file and MAY
273 contain more than one.
275 o Every payload manifest MUST list every payload file name exactly
276 once.
278 o A payload manifest file MUST have a name of the form "manifest-
279 _algorithm_.txt", where _algorithm_ is a string specifying the
280 checksum algorithm used by that manifest as described in
281 Section 2.4.
283 Example payload manifest filenames
285 manifest-sha256.txt
286 manifest-sha512.txt
288 Each line of a payload manifest file MUST be of the form:
290 checksum filepath
292 where _filepath_ is the pathname of a file relative to the base
293 directory, and _checksum_ is a hex-encoded checksum calculated
294 according to _algorithm_ over every octet in the file.
296 o The hex-encoded checksum MAY use uppercase and/or lowercase
297 letters.
299 o The slash character ('/') MUST be used as a path separator in
300 _filepath_.
302 o One or more linear whitespace characters (spaces or tabs) MUST
303 separate _checksum_ from _filepath_.
305 o There is no limitation on the length of a pathname.
307 o The payload manifest MUST NOT reference files outside the payload
308 directory.
310 o If a _filepath_ includes a line feed (LF), a carriage return (CR),
311 carriage return plus line feed (CRLF) or percent sign (%), those
312 characters (and only those) MUST be percent-encoded following
313 [RFC3986].
315 A manifest MUST NOT reference directories. Bag creators who wish to
316 create an otherwise empty directory have typically done so by
317 creating an empty placeholder file with a name such as ".keep".
319 2.2. Optional Elements
321 2.2.1. Tag Manifest: tagmanifest-algorithm.txt
323 A tag manifest is a tag file that lists other tag files and checksums
324 for those tag files generated using a particular bag checksum
325 algorithm.
327 A bag MAY contain one or more tag manifests, in which case each tag
328 manifest SHOULD list the same set of tag files.
330 Each tag manifest MUST list every payload manifest. Each tag
331 manifest MUST NOT list any tag manifests, but SHOULD list the
332 remaining tag files present in the bag.
334 A tag manifest file MUST have a name of the form "tagmanifest-
335 _algorithm_.txt", where _algorithm_ is a string following the format
336 described in Section 2.4 specifying the bag checksum algorithm used
337 in that manifest.
339 Tag manifests SHOULD use the same algorithms as the payload manifests
340 that are present in the bag.
342 Example tag manifest filenames:
344 tagmanifest-sha256.txt
345 tagmanifest-sha512.txt
347 A tag manifest file has the same form as the payload file manifest
348 file described in Section 2.1.3, but MUST NOT list any payload files.
349 As a result, no _filepath_ listed in a tag manifest begins "data/".
351 2.2.2. Bag Metadata: bag-info.txt
353 The "bag-info.txt" file is a tag file that contains metadata elements
354 describing the bag and the payload. The metadata elements contained
355 in the "bag-info.txt" file are intended primarily for human use. All
356 metadata elements are OPTIONAL and MAY be repeated. Because "bag-
357 info.txt" is intended for human reading and editing, ordering MAY be
358 significant and the ordering of metadata elements MUST be preserved.
360 A metadata element MUST consist of a label, a colon ":", a single
361 linear whitespace character (space or tab), and a value, terminated
362 with a line feed (CR), carriage return (LF) or carriage return plus
363 line feed (CRLF).
365 The label MUST NOT contain colon (:), line feeds (LF) or carriage
366 returns (CR). The label MAY contain linear whitespace characters,
367 but MUST NOT start or end with whitespace.
369 It is RECOMMENDED that lines not exceed 79 characters in length.
370 Long values MAY be continued onto the next line by inserting a line
371 feed (LF), a carriage return (CR), or carriage return plus line feed
372 (CRLF) and indenting the next line with one or more linear white
373 space (spaces or tabs). Except for linebreaks such padding does not
374 form part of the value.
376 Implementations wishing to support previous BagIt versions MUST
377 accept multiple linear whitespace before and after the colon when the
378 bag version is earlier than 1.0; such whitespace does not form part
379 of the label or value.
381 The following are reserved metadata elements. The use of these
382 reserved metadata elements are OPTIONAL but encouraged. Reserved
383 metadata element names are case-insensitive. Except where indicated
384 otherwise, these metadata element names MAY be repeated to capture
385 multiple values.
387 Source-Organization Organization transferring the content.
389 Organization-Address Mailing address of the source organization.
391 Contact-Name Person at the source organization who is responsible
392 for the content transfer.
394 Contact-Phone International format telephone number of person or
395 position responsible.
397 Contact-Email Fully qualified email address of person or position
398 responsible.
400 External-Description A brief explanation of the contents and
401 provenance.
403 Bagging-Date Date (YYYY-MM-DD) that the content was prepared for
404 transfer. This metadata element SHOULD NOT be repeated.
406 External-Identifier A sender-supplied identifier for the bag.
408 Bag-Size Size or approximate size of the bag being transferred,
409 followed by an abbreviation such as MB (megabytes), GB, or TB; for
410 example, 42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum
411 (described next), Bag-Size is intended for human consumption.
412 This metadata element SHOULD NOT be repeated.
414 Payload-Oxum The "octetstream sum" of the payload, intended for the
415 purpose of quickly detecting incomplete bags before performing
416 checksum validation. This is strictly an optimization and
417 implementations MUST perform the standard checksum validation
418 process before proclaiming a bag to be valid. This element MUST
419 NOT be present more than once and, if present, MUST be in the form
420 "_OctetCount_._StreamCount_", where _OctetCount_ is the total
421 number of octets (8-bit bytes) across all payload file content and
422 _StreamCount_ is the total number of payload files. This metadata
423 element MUST NOT be repeated.
425 Bag-Group-Identifier A sender-supplied identifier for the set, if
426 any, of bags to which it logically belongs. This identifier
427 SHOULD be unique across the sender's content, and if recognizable
428 as belonging to a globally unique scheme, the receiver SHOULD make
429 an effort to honor reference to it. This metadata element SHOULD
430 NOT be repeated.
432 Bag-Count Two numbers separated by "of", in particular, "N of T",
433 where T is the total number of bags in a group of bags and N is
434 the ordinal number within the group; if T is not known, specify it
435 as "?" (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of
436 145. This metadata element SHOULD NOT be repeated. If this
437 metadata element is present, it is RECOMMENDED to also include the
438 Bag-Group-Identifier element.
440 Internal-Sender-Identifier An alternate sender-specific identifier
441 for the content and/or bag.
443 Internal-Sender-Description A sender-local explanation of the
444 contents and provenance.
446 In addition to these metadata elements, other arbitrary metadata
447 elements MAY also be present.
449 An example "bag-info.txt" file
451 Source-Organization: FOO University
452 Organization-Address: 1 Main St., Cupertino, California, 11111
453 Contact-Name: Jane Doe
454 Contact-Phone: +1 111-111-1111
455 Contact-Email: example@example.com
456 External-Description: Uncompressed greyscale TIFF images from the
457 FOO papers colle...
458 Bagging-Date: 2008-01-15
459 External-Identifier: university_foo_001
460 Payload-Oxum: 279164409832.1198
461 Bag-Group-Identifier: university_foo
462 Bag-Count: 1 of 15
463 Internal-Sender-Identifier: /storage/images/foo
464 Internal-Sender-Description: Uncompressed greyscale TIFFs created
465 from microfilm and are...
467 2.2.3. Fetch File: fetch.txt
469 For reasons of efficiency, a bag MAY be sent with a list of files to
470 be fetched and added to the payload before it can meaningfully be
471 checked for completeness. The fetch file allows a bag to be
472 transmitted with "holes" in it, which can be practical for several
473 reasons. For example, it obviates the need for the sender to stage a
474 large serialized copy of the content while the bag is transferred to
475 the receiver. Also, this method allows a sender to construct a bag
476 from components that are either a subset of logically related
477 components (e.g., the localized logical object could be much larger
478 than what is intended for export) or assembled from logically
479 distributed sources (e.g., the object components for export are not
480 stored locally under one filesystem tree). An OPTIONAL tag file
481 called the fetch file contains such a list.
483 The fetch file MUST be named "fetch.txt". Every file listed in the
484 fetch file MUST be listed in every payload manifest. A fetch file
485 MUST NOT list any tag files.
487 Each line of a fetch file MUST be of the form:
489 url length filepath
491 where _url_ identifies the file to be fetched and MUST be an absolute
492 URI as defined in [RFC3986], _length_ is the number of octets in the
493 file (or "-", to leave it unspecified), and _filepath_ identifies the
494 corresponding payload file, relative to the base directory.
496 The slash character ('/') MUST be used as a path separator in
497 _filepath_. One or more linear whitespace characters (spaces or tabs)
498 MUST separate these three values, and any such characters in the
499 _url_ MUST be percent-encoded [RFC3986]. If _filename_ includes a
500 line feed (LF), a carriage return (CR), carriage return plus line
501 feed (CRLF) or percent sign (%), those characters (and only those)
502 MUST be percent-encoded following [RFC3986]. There is no limitation
503 on the length of any of the fields in the fetch file.
505 2.2.4. Other Tag Files
507 A bag MAY contain other tag files that are not defined by this
508 document. Implementations MUST perform standard checksum validation
509 on any tag file which is listed in a tag manifest but MUST otherwise
510 ignore their contents.
512 2.3. Text Tag File Format
514 All tag files specifically described in this document MUST adhere to
515 the text tag file format described below. Other tag files MAY adhere
516 to the text tag file format described below.
518 Text tag files are line-oriented, and each line MUST be terminated by
519 a line feed (LF), a carriage return (CR), or carriage return plus
520 newline (CRLF). It is RECOMMENDED that the last line in a tag file
521 also ends with LF, CR, or CRLF. Text tag file names MUST end in the
522 extension ".txt".
524 In all text tag files except for the bag declaration file, text MUST
525 use the character encoding specified in the "bagit.txt" bag
526 declaration file. Text tag files except for the bag declaration file
527 MAY include a byte-order mark (BOM) only if the specified encoding
528 requires it for proper decoding. In accordance with [RFC3629], when
529 "bagit.txt" specifies UTF-8 the tag files MUST NOT begin with a byte-
530 order mark (BOM). See Section 2.1.1
532 The use of UTF-8 for text tag files is strongly RECOMMENDED. A
533 future version of BagIt may disallow encodings other than UTF-8.
535 2.4. Bag Checksum Algorithms
537 The payload manifest and tag manifests permit validating the
538 integrity of the payload and tag files in a bag produced by the
539 checksum algorithms. Checksum values MUST be encoded so as to
540 conform to the manifest format specified in Section 2.1.3. However,
541 the internal details of a checksum are outside the scope of this
542 document.
544 To avoid future ambiguity, the checksum algorithm SHOULD be
545 registered in IANA's "Named Information Hash Algorithm Registry"
546 [ni-registry] according to [RFC6920], but MAY for backwards
547 compatibility also be MD5 [RFC1321] or SHA-1 [RFC3174].
549 The name of the checksum algorithm MUST be normalized for use in the
550 manifest's filename by lowercasing the common name of the algorithm
551 and removing all non-alphanumeric characters. Following is a partial
552 list mapping common algorithm names to normalized names:
554 o MD5: md5
556 o SHA-1: sha1
558 o sha-256: sha256
560 o sha-512: sha512
562 Starting with BagIt 1.0, bag creation and validation tools MUST
563 support the SHA-256 and SHA-512 algorithms [RFC6234] and SHOULD
564 enable SHA-512 by default when creating new bags. For backwards
565 compatibility implementers SHOULD support MD5 [RFC1321] and SHA-1
566 [RFC3174]. Implementers are encouraged to simplify the process of
567 adding additional manifests using new algorithms to streamline the
568 process of in-place upgrades.
570 3. Complete and Valid bags
572 A _complete_ bag MUST meet the following requirements:
574 1. Every required element MUST be present (Section 2.1).
576 2. Every file listed in every tag manifest MUST be present.
578 3. Every file listed in every payload manifest MUST be present.
580 4. For BagIt 1.0, every payload file MUST be listed in every payload
581 manifest. Note that older versions of BagIt allowed payload
582 files to be listed in just one of the manifests.
584 5. Every element present MUST conform to BagIt 1.0.
586 A _valid_ bag MUST meet the following requirements:
588 1. The bag MUST be _complete_.
590 2. Every checksum in every payload manifest and tag manifest has
591 been successfully verified against the contents of the
592 corresponding file.
594 4. Examples
596 4.1. Example of a basic bag
598 This is the layout of a basic bag containing an image and a companion
599 OCR file. Lines of file content are shown with added parentheses to
600 indicate each complete line. For brevity this example uses MD5
601 rather than the recommended SHA-512.
603 myfirstbag/
604 |
605 | manifest-md5.txt
606 | (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png)
607 | (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt)
608 |
609 | bagit.txt
610 | (BagIt-version: 1.0 )
611 | (Tag-File-Character-Encoding: UTF-8 )
612 |
613 \--- data/
614 |
615 | 27613-h/images/q172.png
616 | (... image bytes ... )
617 |
618 | 27613-h/images/q172.txt
619 | (... OCR text ... )
620 ....
622 4.2. Example bag using fetch.txt
624 This is the layout of a bag which expects the receiver to download
625 the files listed in the payload manifests prior to validation. Lines
626 of file content are shown with added parentheses to indicate each
627 complete line. For brevity this example uses MD5 rather than the
628 recommended SHA-512.
630 highsmith-tahoe/
631 |
632 | manifest-md5.txt
633 | (102b0e6effe208ef9b29864946de9e22 data/23364a.tif )
634 |
635 | fetch.txt
636 | (https://cdn.loc.gov/master/pnp/highsm/23300/23364a.tif
637 | 216951362 data/23364a.tif )
638 |
639 | bagit.txt
640 | (BagIt-version: 1.0 )
641 | (Tag-File-Character-Encoding: UTF-8 )
642 |
643 | bag-info.txt
644 | (Internal-Sender-Description: Download link found at )
645 | ( https://www.loc.gov/resource/highsm.23364/ )
647 5. Security Considerations
649 5.1. Special directory characters
651 The paths specified in the payload manifests, tag manifests, and
652 fetch files do not prohibit special directory characters which have
653 special meaning on some operating systems. Implementers MUST ensure
654 that files outside the bag directory structure are not accessed when
655 reading or writing files based on paths specified in a bag.
657 All implementations SHOULD have a test suite to guard against special
658 directory characters.
660 For example, a maliciously crafted "tagmanifest-sha512.txt" file
661 might contain entries which begin with a path character such as "/",
662 "..", or a "~username" home directory reference in an attempt to
663 cause a naive implementation to leak or overwrite targeted files on a
664 POSIX operating system.
666 Windows implementations SHOULD test their implementations to ensure
667 that safety-checks prevent use of drive letters and the less commonly
668 used namespace sequences (e.g. "\\?\C:\...") described in [MSFNAM].
670 To assist implementers, the Library of Congress conformance suite
671 [LC-CONFORMANCE-SUITE] has some tests for invalid bags which are
672 expected to fail on POSIX or Windows clients.
674 5.2. Control of URLs in fetch.txt
676 Implementers of tools that complete bags by retrieving URLs listed in
677 a fetch file need to be aware that some of those URLs might point to
678 hosts, intentionally or unintentionally, that are not under control
679 of the bag's sender. Moreover, older checksum algorithms, even if
680 reasonable for detecting corruption during transit, may not offer
681 strong cryptographic protection against intentional spoofing.
683 5.3. File sizes in fetch.txt
685 The size of files, as optionally reported in the fetch file, cannot
686 be guaranteed to match the actual file size to be downloaded.
687 Implementers SHOULD take steps to monitor and abort transfer when the
688 received file size exceeds the file size reported in the fetch file.
689 Implementers SHOULD NOT use the file size in the fetch file for
690 critical resource allocation, such as buffer sizing or storage
691 requisitioning.
693 5.4. Attacks on payload file content
695 The integrity assurance provided by manifests is designed to provide
696 high levels of confidence against data corruption but is not designed
697 to be secure against active attacks. Organizations that need to
698 secure bags against such threats SHOULD agree on additional measures,
699 such as digital signatures, that are out of scope for this
700 specification.
702 6. Practical Considerations (non-normative)
704 6.1. Interoperability
706 This section lists practical considerations for implementers and
707 users. None of the points below are required but they are
708 recommended for general-purpose usage.
710 Upon discovering errors in bags, an implementation is free to take
711 action (for example, logging or reporting) in an application-specific
712 manner. This document does not mandate any particular action.
714 The Library of Congress conformance suite [LC-CONFORMANCE-SUITE] is
715 provided as a public resource to test new implementations for
716 compatibility and error handling.
718 6.1.1. Filename normalization
720 This section provides background information on various challenges
721 caused by differences in how operating systems, filesystems, and
722 common tools handle filenames followed by a list of recommendations
723 for implementers in Section 6.1.1.3.
725 6.1.1.1. Case sensitivity
727 There are two challenges for interoperability related to filename
728 case:
730 o Filesystems such as FAT or EXFAT always convert filenames to
731 uppercase: "example.txt" will be stored as "EXAMPLE.TXT"
733 o Many Unix filesystems save filenames exactly as provided, allowing
734 multiple files which differ only in case: "example.txt" and
735 "Example.txt" are separate files
737 o NTFS and Apple's HFS Plus usually preserve case when storing files
738 but are case-insensitive when retrieving them. A file saved as
739 "Example.txt" will be retrieved by that name but will also be
740 retrieved as "EXAMPLE.TXT", "example.txt", etc.
742 6.1.1.2. Unicode normalization
744 The Unicode specification has common cases where different character
745 sequences produce the same human-meaningful text. These are referred
746 to as "canonically equivalent" and the Unicode specification defines
747 different normalization forms -- see [UNICODE-TR15] for the full
748 details and a brief example below:
750 The common surname "Nunez" normalized in different forms
752 Normalization Form D (Decomposition):
754 Char UTF8 Hex Name
755 ----------------------------------------------
756 N 4e LATIN CAPITAL LETTER N
757 u 75 LATIN SMALL LETTER U
758 \u0301 cc81 COMBINING ACUTE ACCENT
759 n 6e LATIN SMALL LETTER N
760 \u0303 cc83 COMBINING TILDE
761 e 65 LATIN SMALL LETTER E
762 z 7a LATIN SMALL LETTER Z
764 Normalization Form C (Canonical Composition):
766 Char UTF8 Hex Name
767 ----------------------------------------------
768 N 4e LATIN CAPITAL LETTER N
769 u c3ba LATIN SMALL LETTER U WITH ACUTE
770 n c3b1 LATIN SMALL LETTER N WITH TILDE
771 e 65 LATIN SMALL LETTER E
772 z 7a LATIN SMALL LETTER Z
774 Unicode normalization is relevant to BagIt implementors because
775 different systems have different standards for normalization:
777 o Apple's HFS Plus filesystem always normalizes filenames to a
778 fully-decomposed form based on the Unicode 2.0 specification (see
779 [TN1150]).
781 o Windows treats filenames as opaque character sequences (see
782 [MSFNAM]) and will store and return the encoded bytes exactly as
783 provided.
785 o Linux and other common Unix systems are generally similar to
786 Windows in storing and returning opaque byte streams but this
787 behaviour is technically filesystem-dependent.
789 o Utilities used for file management, transfer, and archival may
790 ignore this issue, apply an arbitrary normalization form, or allow
791 the user to control how normalization is applied.
793 In practice, this means that the encoded filename stored in a
794 manifest may fail a simple file existence check because the
795 filename's normalization was changed at some point after the manifest
796 was written. This situation is very confusing for users because the
797 filenames are visually indistinguishable and the "missing" file is
798 obviously present in the payload directory.
800 6.1.1.3. Recommendations
802 o Implementations SHOULD discourage the creation of bags containing
803 files which differ only in case.
805 o Implementations SHOULD prevent the creation of bags containing
806 files which differ only in normalization form.
808 o BagIt implementations SHOULD tolerate differences in normalization
809 form by comparing both the list of filesystem and manifest names
810 after applying the same normalization form to both.
812 o Implementations SHOULD issue a warning when multiple manifests are
813 present which differ only in case or normalization form.
815 6.1.2. Windows and Unix file naming
817 As specified above, only the Unix-based path separator ('/') may be
818 used inside filenames listed in BagIt manifest and fetch.txt files.
819 When bags are exchanged between Windows and Unix platforms, the path
820 separator SHOULD be translated as needed. Receivers of bags on
821 physical media SHOULD be prepared for filesystems created under
822 either Windows or Unix. Besides the fundamental difference between
823 path separators ('\' and '/'), generally, Windows filesystems have
824 more limitations than Unix filesystems.
826 Windows path names have a maximum of 255 characters, and none of
827 these characters may be used in a path component:
829 < > : " / | ? *
831 Windows also reserves the following names, with or without a file
832 extension:
834 CON, PRN, AUX, NUL
835 COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9
836 LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9
838 See [MSFNAM] for more information and possible alternatives.
840 6.1.3. Legacy checksum tools
842 Some bags have been manually assembled using checksum utilities such
843 as those contained in the GNU Coreutils package (md5sum, sha1sum,
844 etc.), collectively referred to here as "md5sum". Implementers who
845 desire wide support of legacy content should be aware of some known
846 quirks of these tools:
848 md5sum can be run in "text mode" which causes it to normalize line-
849 endings on some operating systems. On Unix-like systems both modes
850 will usually produce the same results but on systems like Windows
851 they can produce different results based on the file contents. The
852 md5sum output format has two characters between the checksum and the
853 filepath: the first is always a space and the second is an asterisk
854 ("*") for binary mode and a space for text mode.
856 A final note about md5sum-generated manifests is that for a
857 _filepath_ containing a backslash ('\'), the manifest line will have
858 a backslash inserted in front of the _checksum_ and, under Windows,
859 the backslashes inside _filepath_ can be doubled.
861 Implementers MAY wish to accept this format by ignoring a leading
862 asterisk or handling differences in line termination gracefully but,
863 if so, implementations MUST warn the user that the bag in question
864 will fail strict validation. In such cases it is RECOMMENDED that
865 tools provide an easy option to update the bag with valid manifests.
867 7. Augmented Backus-Naur Form (non-normative)
869 The Augmented Backus-Naur Form (ABNF) rules provided below are non-
870 normative. If there is a discrepancy between requirements in the
871 normative sections and the ABNF, the requirements in the normative
872 sections prevail. Some definitions use the core rules (e.g. DIGIT,
873 HEXDIG, etc) as defined in [RFC4234]
875 7.1. Bag Declaration: bagit.txt
877 bagit.txt ABNF rules:
879 bagit-txt = "BagIt-Version: " 1*DIGIT "." 1*DIGIT ending
880 "Tag-File-Character-Encoding: " encoding ending
881 encoding = 1*CHAR
882 ending = CR / LF / CRLF
884 7.2. Payload Manifest: manifest-algorithm.txt
886 Payload Manifest ABNF rules:
888 payload-manifest = 1*payload-manifest-line
889 payload-manifest-line = checksum 1*WSP filepath ending
890 checksum = 1*case-hexdig
891 case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" /
892 "D" / "d" / "E"/ "e"/ "F" / "f"
893 filepath = "data/"
894 1*( unreserved / pct-encoded / sub-delims )
895 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
896 sub-delims = "!" / "$" / "&" / DQUOTE / "'" / "(" / ")" /
897 "*" / "+" / "," / ";" / "=" / "/"
898 pct-encoded = "%0D" / "%0d" / "%0A" / "%0a" / "%25"
899 ending = CR / LF / CRLF
901 7.3. Bag Metadata: bag-info.txt
903 bag-info.txt ABNF rules:
905 metadata = 1*metadata-line
906 metadata-line = key ":" WSP value ending *(continuation ending)
907 key = 1*non-reserved
908 value = 1*non-reserved
909 continuation = WSP 1*non-reserved
910 non-reserved = VCHAR / WSP
911 ; any valid character for the specific encoding
912 ; except those that match "ending"
913 ending = CR / LF / CRLF
915 7.4. Fetch File: fetch.txt
917 fetch.txt ABNF rules:
919 fetch = 1*fetch-line
920 fetch-line = url 1*WSP length 1*WSP filepath ending
921 url =
922 length = 1*DIGIT / "-"
923 filepath = ("data/"
924 1*( unreserved / pct-encoded / sub-delims ))
925 ending = CR / LF / CRLF
927 8. Contributors
929 Additional contributors to the authoring of BagIt are Andy Boyko,
930 David Brunton, Rosie Storey, Ed Summers, Brian Vargas, and Kate
931 Zwaard.
933 9. Acknowledgements
935 BagIt benefitted from the thoughtful assistance of Stephen Abrams,
936 Mike Ashenfelder, Dan Chudnov, Dave Crocker, Scott Fisher, Brad
937 Hards, Erik Hetzner, Keith Johnson, Leslie Johnston, David Loy, Mark
938 Phillips, Tracy Seneca, Stian Soiland-Reyes, Brian Tingle, Adam
939 Turoff, and Jim Tuttle.
941 10. IANA Considerations
943 This draft does not request any action from IANA.
945 11. References
947 11.1. Normative References
949 [cs-registry]
950 IANA, "Character Set Registry", 12 2013,
951 .
954 [ni-registry]
955 IANA, "Named Information Hash Algorithm Registry", 9 2016,
956 .
959 [RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321,
960 DOI 10.17487/RFC1321, April 1992,
961 .
963 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
964 Requirement Levels", BCP 14, RFC 2119,
965 DOI 10.17487/RFC2119, March 1997,
966 .
968 [RFC3174] Eastlake 3rd, D. and P. Jones, "US Secure Hash Algorithm 1
969 (SHA1)", RFC 3174, DOI 10.17487/RFC3174, September 2001,
970 .
972 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
973 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
974 2003, .
976 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
977 Resource Identifier (URI): Generic Syntax", STD 66,
978 RFC 3986, DOI 10.17487/RFC3986, January 2005,
979 .
981 [RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms
982 (SHA and SHA-based HMAC and HKDF)", RFC 6234,
983 DOI 10.17487/RFC6234, May 2011,
984 .
986 [RFC6920] Farrell, S., Kutscher, D., Dannewitz, C., Ohlman, B.,
987 Keranen, A., and P. Hallam-Baker, "Naming Things with
988 Hashes", RFC 6920, DOI 10.17487/RFC6920, April 2013,
989 .
991 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
992 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
993 May 2017, .
995 11.2. Informative References
997 [ENCDEP] Tabata, K., "A Collaboration Model between Archival
998 Systems to Enhance the Reliability of Preservation by an
999 Enclose-and-Deposit Method", 2005,
1000 .
1002 [LC-CONFORMANCE-SUITE]
1003 The Library of Congress, "BagIt Conformance Suite", 2016-,
1004 .
1007 [MSFNAM] Microsoft, Inc., "Naming a File", 2008,
1008 .
1010 [RFC4234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
1011 Specifications: ABNF", RFC 4234, DOI 10.17487/RFC4234,
1012 October 2005, .
1014 [TN1150] Apple Inc., "Technical Note TN1150: HFS Plus Volume
1015 Format", 3 2004,
1016 .
1019 [UNICODE-TR15]
1020 Unicode Consortium, "Unicode(R) Standard Annex #15:
1021 Unicode Normalization Forms", 2 2016,
1022 .
1024 Authors' Addresses
1025 John A. Kunze
1026 California Digital Library
1027 415 20th St, 4th Floor
1028 Oakland, CA 94612
1029 US
1031 Email: jak@ucop.edu
1033 Justin Littman
1034 Stanford Libraries
1035 518 Memorial Way
1036 Stanford, CA 94305
1037 USA
1039 Email: justinlittman@stanford.edu
1041 Liz Madden
1042 Library of Congress
1043 101 Independence Avenue SE
1044 Washington, DC 20540
1045 USA
1047 Email: emad@loc.gov
1049 John Scancella
1051 Email: john.scancella@gmail.com
1053 Chris Adams
1054 Library of Congress
1055 101 Independence Avenue SE
1056 Washington, DC 20540
1057 USA
1059 Email: cadams@loc.gov