idnits 2.17.1 draft-flanagan-rfc-preservation-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (January 14, 2015) is 3383 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-08) exists of draft-hansen-rfc-use-of-pdf-03 -- Obsolete informational reference (is this intentional?): RFC 5741 (Obsoleted by RFC 7841) -- Obsolete informational reference (is this intentional?): RFC 6635 (Obsoleted by RFC 8728) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group H. Flanagan 3 Internet-Draft RFC Editor 4 Intended status: Informational January 14, 2015 5 Expires: July 18, 2015 7 Digital Preservation Considerations for the RFC Series 8 draft-flanagan-rfc-preservation-03 10 Abstract 12 The RFC Editor is both the publisher and the archivist for the RFC 13 Series. This document applies specifically to the archivist role of 14 the RFC Editor. It provides guidance on when and how to preserve 15 RFCs, and the tools required to view or re-create RFCs as necessary. 16 This document also highlights where gaps are in the current process, 17 and where compromises are suggested to balance cost with ideal best 18 practice. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on July 18, 2015. 37 Copyright Notice 39 Copyright (c) 2015 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 55 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 56 1.2. Life cycle of Digital Preservation . . . . . . . . . . . 4 57 2. Updating Policy and Procedure . . . . . . . . . . . . . . . . 5 58 2.1. Acquisition of Documents . . . . . . . . . . . . . . . . 6 59 2.2. Ingest of Documents . . . . . . . . . . . . . . . . . . . 6 60 2.3. Metadata and document registration . . . . . . . . . . . 7 61 2.4. Normalization and standardization of canonical file 62 structure and format . . . . . . . . . . . . . . . . . . 9 63 2.4.1. 'Best Effort' data retention . . . . . . . . . . . . 10 64 2.4.2. Single format for archival purposes . . . . . . . . . 11 65 2.4.3. Holistic archiving of the computing environment . . . 11 66 2.5. Transformation/migration to current publication formats . 12 67 2.6. System Parameters . . . . . . . . . . . . . . . . . . . . 13 68 2.7. Financial Planning . . . . . . . . . . . . . . . . . . . 13 69 3. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 14 70 4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 71 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 72 6. Security Considerations . . . . . . . . . . . . . . . . . . . 15 73 7. Draft Change Log . . . . . . . . . . . . . . . . . . . . . . 15 74 7.1. -02 to -03 . . . . . . . . . . . . . . . . . . . . . . . 15 75 7.2. -01 to -02 . . . . . . . . . . . . . . . . . . . . . . . 16 76 7.3. -00 to -01 . . . . . . . . . . . . . . . . . . . . . . . 16 77 8. Informative References . . . . . . . . . . . . . . . . . . . 16 78 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 17 80 1. Introduction 82 The RFC Editor is both the publisher and the archivist for the RFC 83 Series, a series of technical specifications and policy documents 84 that includes foundational Internet standards [RFC6635] [RFCSERIES]. 85 As the publisher of these documents, the goal is to produce clear, 86 consistent, and readable documents for the community using as many 87 modern features, such as hyperlinks and content markup, within the 88 document as necessary to convey the information the authors intended 89 for their audience. As the archivist, however, the main goal is to 90 preserve both the information described and the documents themselves 91 for the indefinite future. To meet both of these goals, the RFC 92 Editor must find the necessary balance between the publication needs 93 of today and the archival needs of tomorrow, while acknowledging a 94 finite set of resources to complete both aspects of the RFC Editor 95 function. 97 While many files are created during the publication process, this 98 document focuses on the archival needs of RFCs and the Internet- 99 Drafts (I-Ds) that are approved for publication; I-Ds before they are 100 approved for publication by the appropriate stream-approving body are 101 out of scope. 103 To summarize, the key areas of tension between the roles of publisher 104 and archivist are: 106 o the desire of the publisher to meet the needs expressed by the 107 authors who want to use the latest technology within their 108 documents, such as vector graphics, live links, and a rich set of 109 metadata; 111 o the desire of the archivist to support only the simplest format 112 for documents possible--currently held by the Series to be ASCII- 113 only plain-text--so that the tools needed to view the documents 114 are equally simple and resistant to changes in technology, 115 resulting in a set of documents that will be easier to archive for 116 at least the next several decades if not centuries. 118 Through most of the history of the RFC Series, the file format for 119 RFCs has been plain text with an ASCII-only character set. This 120 choice offered the simplest format likely to remain available to the 121 largest number of consumers, and the one most likely to be resistant 122 to changes in technology over time. Increasingly, however, consumers 123 and authors are requesting additional features that would allow for 124 easy reading on a wider array of devices and retain all the metadata 125 an author intended in their document. In 2013, RFC 6949, "RFC Series 126 Format Requirements and Future Development," captured the high level 127 requirements for the Series; the fundamental issue being that the 128 plain-text, ASCII-only documents no longer met the needs of the 129 communities interested in using and producing RFCs [RFC6949]. 131 The assertion that plain-text, ASCII-only documents no longer meet 132 the needs of the community in turn suggests that the simple archive 133 process maintained by the RFC Editor is also no longer sufficient. 134 More complex tools and file formats require a more complex process to 135 make sure that RFCs can still be read and rendered far into the 136 future. This document describes the considerations that must inform 137 any changes in policy and procedure, and describes a model for the 138 RFC Series to follow when additional formats beyond the ASCII-only, 139 plain-text RFCs are published. The functional model that provides 140 the framework for the archival process described in this document was 141 derived from the ISO Open Archival Information System (OAIS) 142 Reference Model, defined in "Space data and information transfer 143 systems - Open archival information system (OAIS) - Reference model" 144 [ISO14721]. 146 1.1. Terminology 148 Acquisition: The point at which a document is accepted by the RFC 149 Editor for future inclusion into the archive. 151 Ingest: The point at which a digital object is assigned all necessary 152 metadata to describe the object and its contents, and added to the 153 archive. 155 Bit stream preservation: The process of storing and maintaining 156 digital objects over time, ensuring that there is no loss or 157 corruption of the bits making up those objects. 159 Content preservation: The retention of the ability to read, listen, 160 or watch a digital file in perpetuity. It is not about the bits 161 being stored; it is about being able to access and present those bits 162 to the user. 164 1.2. Life cycle of Digital Preservation 166 The basic process for preserving digital information has been 167 described by a variety of organizations. From the Life cycle 168 Information For E-Literature (LIFE) project in the United Kingdom, to 169 the ongoing digital preservation work in the U.S. Library of 170 Congress, the basic digital preservation process is straightforward 171 [LIFE] [USLOC]. Documents are acquired and processed, metadata is 172 recorded, physical media is refreshed, and content is regularly 173 checked to see if it is still accessible by interested parties. The 174 complexities arise when one considers the need to preserve both the 175 bits of the digital objects themselves and the tools with which to 176 express those bits in an environment that experiences rapid changes 177 in technology. 179 For most of the existence of the RFC Series, the digital preservation 180 process has been fairly simple, focusing on bit stream preservation 181 and relying on paper copies of digital files. 183 The archival process for the RFC Series is as follows: 185 1. Acquisition: The RFC Editor database is updated to indicate an 186 Internet Draft (I-D) has been approved for publication. At this 187 point, the document is taken through the editorial process on the 188 way to publication [RFC-PUB]. 190 2. Ingest: The RFC is added to the archive at the time of 191 publication. 193 3. Metadata creation: The details regarding an RFC, including RFC 194 number, author, title, abstract, etc., are created at time of 195 publication. Additional metadata in the form of status and 196 errata can be added or changed at any time, following the process 197 of the originating document stream. 199 4. Bit stream preservation: This part of the process is handled as 200 part of the IT system administration; all servers, disks, and 201 backup technology are refreshed on a regular cycle. 203 5. Content preservation: All RFCs are printed out on paper at time 204 of publication, and the electronic files preserved on disk and in 205 backups with no particular focus on preserving the entire 206 computing environment used to create the electronic documents. 208 When the format for RFCs transitions from plain-text, ASCII-encoded 209 files to and XML format with multiple outputs, the archival process 210 overall will become more complex. Additional metadata and some or 211 possibly all of the computing environment may need to be added to the 212 archive. 214 2. Updating Policy and Procedure 216 RFCs are created and published as digital objects. Unlike paper- 217 based publications, a digital collection requires a focus on 218 retaining the details of the technology as well as retaining the 219 object itself. Specifically, a digital archive needs to: 221 o consider the inherent instability of digital media; 223 o plan for a relatively short path to technological obsolescence; 225 o schedule regular media updates; 227 o apply predefined criteria for technology evaluation; and, 229 o ensure the continued authenticity and integrity of RFCs through 230 any changes in technology. 232 As the custodian and canonical source of RFCs and associated errata, 233 the RFC Editor must consider how to ensure the availability and 234 integrity of this document series far into the future and determine 235 whether the focus must be on bit stream preservation, content 236 preservation, or both. 238 The RFC Editor has several advantages in acting as the digital 239 archivist for the Series. Since the RFC Editor is the publisher as 240 well as the archivist, the RFC Editor controls the format of the 241 material, the process for adding those materials to an archive, and 242 can add any additional metadata considered necessary. External 243 materials, while a major consideration for more general archives, are 244 no longer accepted by the RFC Editor. (See "Internet Archaeology: 245 Documents from Early History" for the list of non-RFC digital objects 246 held by the RFC Editor [RFC-HISTORY].) 248 This document describes several different preservation models that 249 may fit the needs of the Series, and raises several points for 250 community consideration. Specifically, it covers information on: 252 o Acquisition of documents 254 o Ingest of documents 256 o Metadata and document registration 258 o Normalization and standardization of canonical file structure and 259 format 261 o Transformation/migration to current publication formats 263 o Content and computing environment preservation 265 o System parameters 267 o Financial impact 269 2.1. Acquisition of Documents 271 The acquisition process for documents intended for the archive starts 272 with the submission of an approved I-D for publication. During the 273 editorial process, information such as the document metadata are 274 finalized prior to publication. The initial I-D as submitted and the 275 RFC produced from it do not formally enter the archive, however, 276 until the time of publication, which is considered the point of 277 ingest from an archival perspective. 279 2.2. Ingest of Documents 281 Once an RFC is published, the canonical format is considered 282 immutable. At this point, the RFC Production Center, one of the 283 internal roles within the RFC Editor, assigns the document metadata 284 an archivist needs to identify the unique object. 286 In the case of RFCs, the metadata is assigned to a document at the 287 time of publication includes: 289 o the RFC number 291 o ISSN 293 o publication date 295 o Digital Object Identifier (DOI) --future 297 Additional metadata, such as author name, is assigned earlier in the 298 document creation process, but it is subject to change up to the 299 point of publication. More information on metadata is available in 300 section "Metadata and document registration." 302 The publication of an RFC--the point at which responsibility for the 303 document moves to the RFC Publisher, another internal role within the 304 RFC Editor--starts the formal archival process for the documents. At 305 that time, the canonical document should be digitally signed. 306 Information regarding the signatures and how to verify them must be 307 made available on the RFC Editor website. 309 In terms of deciding what to accept in the archive--a major question 310 for most archives, and yet simple for the RFC Series--the RFC Editor 311 accepts documents that are approved for publication by the stream 312 approving body of one of the document streams: the IETF, IAB, IRTF, 313 or Independent Submissions streams [RFC5741]. Each document stream 314 has defined processes on when and how I-Ds are approved and submitted 315 to the RFC Editor for publication. The RFC Editor does not select 316 documents for publication and archiving; the RFC Editor edits and 317 publishes documents as directed by the document streams. 319 The RFC Editor holds no copyright on I-Ds or RFCs. As per the IETF 320 Trust Legal Provisions, the copyright for RFCs is held by the authors 321 and the IETF Trust [TLP]. At any point in time, the current entities 322 providing RFC Editor services must be able to release the archive of 323 RFCs to the IETF Trust. 325 Note: The RFC Editor is currently only responsible for RFCs; any 326 associated data sets or other research data is not considered within 327 the RFC Editor's mandate at this time and therefore no consideration 328 to the archival requirements of such datasets is covered in this 329 document. 331 2.3. Metadata and document registration 333 Metadata is data about data. In the field of digital archiving, this 334 is the data that clearly identifies every aspect of a document, from 335 its identifier (i.e., the RFC number, the I-D draft string) to the 336 size and file format of the document and more. Metadata is stored in 337 a central registry that stores information on what exactly is being 338 preserved, where it is located, information on authenticity and 339 provenance, and details on the hardware and/or software needed to 340 view or create the documents. 342 The RFC Editor maintains this registry in the form of a database that 343 includes all metadata available for documents engaged in the final 344 editing and publication process. This database feeds the search 345 engine on the RFC Editor website and the Info Pages available for 346 every RFC (e.g., http://www.rfc-editor.org/info/rfc####). 348 Current list of metadata presented in the RFC Info pages 350 o RFC number 352 o Canonical URI 354 o Title 356 o Status 358 o Updates 360 o Authors 362 o Stream 364 o Abstract 366 o Content-Type 368 o Character Set 370 o ISSN 372 o Publication date 374 Metadata to be added in the future 376 o Digital Object Identifier (DOI) 378 o Publication format URIs 380 Info pages also include links to: errata, IPR searches, plain text 381 and XML citation files. 383 In terms of best practice, all documents used as normative references 384 within an RFC would also be stored in the archive. While this is 385 done automatically when the normative reference is another RFC (the 386 usual case), retaining a copy of third-party documents is considered 387 out of scope for the RFC Editor. As the digital archive industry 388 stabilizes, services such as Perma.CC may be a reasonable compromise 389 [PERMACC]. Those services provide a permanent URI and image capture 390 of online documents, with a goal of buffering against URI and online 391 availability changes. 393 2.4. Normalization and standardization of canonical file structure and 394 format 396 The normalization process is perhaps the most technically critical 397 parts of digital archiving. The purpose here is content 398 preservation--making sure the data accepted for archiving are in the 399 most stable and easily accessed formats possible for the long-term 400 future, requiring the least amount of re-engineering and emulation of 401 environments in order to view the document in the future. 402 Normalization is about enabling long-term access to the information 403 within a document. 405 Over the history of the RFC Series, documents have been submitted for 406 publication in a variety of formats, including paper in the earliest 407 RFCs. Today, the majority of RFCs are available in both a canonical 408 plain-text format and PDF format. For exceptions to this list, see 409 the RFC Online Project [RFC-ONLINE]. 411 Currently, all RFCs are printed out to paper and stored at time of 412 publication. This has been a reasonable backup plan for several 413 decades. With few of the features one might expect from a digital 414 document format (including links, metadata within the document, or 415 line drawings), plain-text files do not lose much, if any, 416 information when printed out to paper. As the published formats 417 change (see RFC 6949), however, printing to paper provides less value 418 as much of the metadata that is an intrinsic yet invisible part of 419 the rendered document will be lost in such printing. With that in 420 mind, the focus needs to change on preserving the new file formats 421 electronically. 423 While each RFC today is printed to paper and all electronic versions 424 stored on multiple hard drives, no particular effort is made to 425 ensure copies of the software used to render or read the canonical 426 plain-text RFC are also archived. The RFC Editor has several choices 427 on how to adapt to a more complex set of data to archive and follow 428 best practice as defined by the digital archive community: 430 o a simplified bit stream preservation model that focuses on "best 431 effort" standard data retention practices, which rely on backups, 432 upgrades, and regular equipment change to preserve the data, and 433 assuming that emulators may be built when needed if the formats 434 used go out of common use (a significant part of the existing 435 model); 437 o a content preservation model that focuses on one publication 438 format as a version most likely to be viewable and provide all 439 necessary metadata in the future (a viable option considering the 440 fact that PDF/A-3--one of the intended publication formats--was 441 designed for this type of archiving) [PDF]; 443 o a complex bit stream and content preservation model that focuses 444 on archiving the canonical XML and the entire computing 445 environment required to create, view and render all outputs from 446 that file (the "best practice" when looking at this from an 447 archivist's perspective). 449 Those options are listed in order of least to greatest complexity and 450 expense. More detail on each option is described below. 452 2.4.1. 'Best Effort' data retention 454 When dealing with very simple data structures such as plain-text, 455 ASCII-only files, the experience of the RFC Series suggests that for 456 the last few decades, hardware and operating system changes have had 457 minimal impact on the document files being stored. While a complete 458 failure of an operating system migration in the past had corrupted 459 the data set, that situation represents a somewhat different problem 460 than the tools themselves changing such that plain-text files are not 461 easily read with existing technology. Given that the basic plain- 462 text format and ASCII encoding remain in common use, the standard 463 protections against file corruption and data loss, such as disk 464 mirroring, off-site backups, and periodic restoration testing will 465 continue to provide access to the entirety of the RFC Series for the 466 foreseeable future. As has been pointed out, both in this document 467 and in broader community discussion, that is not sufficient when one 468 moves into more complex formats such as XML, HTML, PDF, or other 469 proprietary formats offered by today's large IT companies. The risk 470 of technological change resulting in the file formats mentioned being 471 deprecated or changed without backwards compatibility is fairly high 472 when looking at a future of decades or centuries. 474 It is recommended that this model of archiving the RFC Series cease 475 to be the primary model after the plain-text, ASCII-only format is no 476 longer the canonical format. Best effort data retention is a 477 necessary but not sufficient level of effort for preserving a digital 478 archive. For more guidance on how to define best effort data 479 retention, the section on Media and Formats, Summary Recommendations, 480 in the latest version of the Digital Preservation Handbook provides 481 useful, concrete information [DPC]. 483 2.4.2. Single format for archival purposes 485 If one ascribes to the idea that preserving the information described 486 by a document, rather than the document itself, is the primary 487 purpose of an archive, then focusing efforts on a single file format 488 is a reasonable option. Some well-supported archival tooling 489 projects follow this route, such as Archivemetica 490 https://www.archivematica.org/wiki/Main_Page . By selecting a 491 feature-rich yet fundamentally stable file format for documents, an 492 organization may avoid expensive whole-environment reconstruction in 493 order to view the document. The PDF/A formats were designed to be an 494 archival format for electronic documents, and PDF/A-3 is one of the 495 options intended for publication as the RFC Series moves from a 496 plain-text canonical format to an XML canonical format with multiple 497 publication formats. A PDF/A-3 file can be produced that embeds the 498 XML from which the PDF/A-3 file was created, which in turn allows for 499 both original and rendered document validation--if one has the 500 correct tools available to see the source of the PDF/A-3 file 501 [I-D.hansen-rfc-use-of-pdf]. 503 When looking at the need to archive RFCs in a resource-limited 504 environment, a content preservation-only model has merit, but it is 505 not without risks. First, PDF/A-3 will not be the canonical format, 506 but is intended to be one of the rendered outputs. It may contain 507 rendering bugs that were not intended to be in the document. Second, 508 while the various PDF/A formats were designed to be archival, it has 509 not been put to the test of time to determine if will actual live up 510 to its design goals. 512 It is a valid option to consider, but the risks, priorities, and 513 costs must be discussed by the community before a decision is made to 514 follow this path. The best option may be to combine this with one of 515 the other methods of archiving described in this document to help 516 minimize both risk and cost. 518 2.4.3. Holistic archiving of the computing environment 520 Preserving everything published through the RFC Editor in order to 521 have a permanent record of information, standards, and best practice, 522 is arguably the whole point of being an archival series. One can 523 argue that it is not only about the information described in an RFC, 524 it is also about supporting Intellectual Property Rights (IPR) and 525 retaining the history of the Internet. In following this model, 526 however, one must consider the complexity of the archival environment 527 as matching, and possibly exceeding, the complexity of the file 528 formats being preserved. 530 Consider a future where XML has been obsoleted for half a century, 531 HTML5 was a format used three to four human generations ago, and PDF/ 532 A-3 no longer supported by any existing company's reading software. 533 In order for RFCs that were produced with XML as their canonical 534 format, an archive must not only hold the data, it must also hold the 535 entire computing environment that allows the data to be rendered and 536 viewed. Operating systems and hardware on which those OSs can run, 537 each major version of each piece of software used or relied upon 538 during the publication of an RFC, browsers and readers for HTML, PDF, 539 and any other publication format, must be preserved in some fashion. 540 This is considered best practice when archiving digital documents. 541 It is also the most expensive, and the cost only increases over time 542 as more and more instances of the computing environment must be 543 preserved over the lifetime of the Series. 545 This is a valid option to consider, but sheer scope of resources 546 required suggests that this must be discussed by the community before 547 a decision is made. Pursuing this may require an entirely different 548 paradigm for the RFC Editor than what has been considered in the 549 past; expanding the scope and resources for the RFC Editor, finding a 550 third-party to take over the responsibilities of archiving, or some 551 other option may be necessary. 553 2.5. Transformation/migration to current publication formats 555 Noting that normalization is a complex subject, it is important to 556 consider what to do to mitigate the risk of failure of the 557 normalization process. 559 The RFC Editor is responsible for making RFCs available to the 560 Internet community. The canonical version of an RFC does not change 561 once published; any formats officially rendered from the canonical 562 version, however, may change. One way to mitigate the need to 563 preserve the entire computing environment for an RFC, including web 564 browsers and PDF readers, would be to take advantage of the non- 565 canonical nature of the publication formats and re-render them from 566 the canonical source at the point that browser or reader technology 567 has changed sufficiently to make RFCs largely unavailable to 'modern' 568 tools. 570 For example, the RFC Editor may develop a practice of starting an 571 annual review of the tools needed to view the publication formats 572 created by the RFC Editor, and determine whether or not the current 573 common and popular reader technologies (i.e., web browsers, PDF 574 viewers, e-readers) can view the existing publication formats. 576 During that review, the RFC Editor would work with the community to 577 determine if the current publication formats meet the needs of the 578 community, and whether any should be retired or added to improve the 579 availability of information to the community at that time. 581 2.6. System Parameters 583 While the industry best practice on the backup and restoration of 584 data is not sufficient as a long-term archival solution, it is still 585 a necessary part of keeping the Series available now and into the 586 future. In the past, nearly 800 RFCs had to be manually transcribed 587 from paper back to electronic format due to a failed server migration 588 and insufficient backups. 590 The underlying servers hosting the tools, database, RFCs, and errata 591 are the physical link in the archive environment. While such systems 592 cannot and should not remain static and unchanging, there must be 593 clear documentation regarding the environment, in particular the 594 storage, backups, and recovery processes for all RFC-related 595 material. The documentation must include information on the refresh 596 cycle for the physical storage and backup media and describe a 597 regular cycle of data restoration and/or migration testing. 599 2.7. Financial Planning 601 Having a digital archive policy provides input into the budget 602 process. The main costs associated with digital archives come from 603 the complexity and quantity of the material being archived, as 604 described in the section on Normalization. To quote the Digital 605 Protection Conservancy Handbook: 607 The complexity of the material submitted and number of objects 608 acquired generally has more impact on costs than the total storage 609 size. The type and variety of formats accepted into the 610 repository will also affect cost, because for example proprietary 611 formats are likely to be more difficult and expensive to manage in 612 the long term. It may be possible to reduce costs by limiting the 613 formats the repository will accept, or transforming material into 614 a standard common format. This can be done to reduce the number 615 of file types and possibly reducing the storage size. However, it 616 is also necessary to realise that due to storage redundancies 617 required for back up each gigabyte of deposited data requires more 618 than one gigabyte of disk space in repository storage. -- 619 http://www.dpconline.org/advice/preservationhandbook/ 620 institutional-strategies/costs-and-business-modelling 622 Estimating potential costs and providing figures it outside of the 623 scope of this document, but it should be noted that costs are a major 624 factor when determining what level of archival practice an 625 organization will follow. 627 3. Recommendations 629 Given the need to balance cost and complexity with retention of 630 information for historic, legal, and informational purposes, 631 preservation efforts should focus on the XML canonical format, the 632 PDF/A-3 format, the xml2rfc tool and its documentation, and at least 633 one PDF reader application. All other formats and the overall 634 computing environment should be stored as described in "best effort" 635 data retention, which should in turn be described in the appropriate 636 vendor contract for the RFC Publisher. 638 Particular preservation efforts should be made by: 640 o choosing a format designed for archiving RFCs (PDF/A-3) 642 o embedding the canonical XML format within the PDF/A-3 file for 643 RFCs 645 o adding a digital signature and checksum for the canonical XML and 646 the PDF/A-3 files 648 o retaining a copy of the plain-text or XML file submitted for 649 approved I-Ds 651 o retaining all major versions of the tools and their associated 652 documentation used to acquire and ingest an RFC 654 o retaining the final XML file as well as the PDF/A-3 file with the 655 embedded XML 657 o retaining at least two software reader applications to ensure the 658 PDF/A-3 and XML files can be viewed in the future 660 o partnering with other digital archives around the world to mirror 661 copies of the target data 663 In order to control costs and focus the archiving effort on the 664 entire content of an RFC, including the metadata and other features 665 embedded within each RFC published in more than just plain text, 666 printing each RFC upon publication to paper is no longer reasonable. 667 Proper data storage and mirrored copies of RFCs provides more 668 efficient and effective copies in case of catastrophic failure of the 669 existing archive of material. 671 Preservation efforts should be reviewed and validated through a bi- 672 annual audit that will verify that the targeted content and all its 673 associated metadata can be read with existing tools. The full 674 process from acquisition to ingest should be reviewed to ensure that 675 best current practice is being followed from a digital archive 676 community perspective. Since the overall model for the RFC Editor- 677 maintained digital archive follows the OAIS Reference model, the 678 associated audit guidelines should be followed. While the RFC Editor 679 does not seek to be recognized as 'OAIS-compliant' at this time, use 680 of the ISO standard, "Audit and Certification of Trustworthy Digital 681 Repositories," would provide a solid, accepted method for structuring 682 an audit for this digital archive [ISO16363]. 684 4. Summary 686 The RFC Series is worth archiving. It contains the history of the 687 early Internet, as well as some of the key standards for Internet 688 technology and best practice today. Who knows what the community 689 will create in the future? There are many ways to preserve the 690 Series, from relying on preservation of the bits, to focusing on a 691 single file format, to preserving the entire computing environment. 692 Each possibility, or the permutations from them, involves risks and 693 varying levels of resources. The goal of this document is to 694 describe the possibilities and associated risks so that the community 695 can come to an informed decision regarding what they are willing to 696 see supported far into the future. 698 5. IANA Considerations 700 None 702 6. Security Considerations 704 TBD 706 7. Draft Change Log 708 To be removed before publication 710 7.1. -02 to -03 712 Life Cycle of Digital Preservation: modified language to be more 713 clear as to when the archival process becomes more complex 715 Recommendations: added that the final XML file should be one of the 716 items retained in an archive 718 7.2. -01 to -02 720 Updated text where appropriate to indicate approved I-Ds should also 721 be targeted for archiving 723 7.3. -00 to -01 725 Recommendations: added the requirement to archive reader software, 726 and to stop printing out to paper 728 8. Informative References 730 [I-D.hansen-rfc-use-of-pdf] 731 Hansen, T., Masinter, L., and M. Hardy, "PDF for an RFC 732 Series Output Document Format", draft-hansen-rfc-use-of- 733 pdf-03 (work in progress), October 2014. 735 [DPC] DigitalPreservationCoalition, "Digital Preservation 736 Handbook", 2012, 737 . 739 [ISO14721] 740 International Organization for Standardization, ""Space 741 data and information transfer systems -- Open archival 742 information system (OAIS) -- Reference model"", ISO 743 14721:2012 , 2012. 745 [ISO16363] 746 International Organization for Standardization, ""Space 747 data and information transfer systems -- Audit and 748 Certification of Trustworthy Digital Repositories"", ISO 749 16363:2011 , 2011. 751 [LIFE] Hole, B., "LIFE^3: Predictive Costing of Digital 752 Preservation", July 2010, 753 . 755 [PDF] International Organization for Standardization, 756 ""Electronic document file format for long-term 757 preservation -- Part 3: Use of ISO 32000-1 with support 758 for embedded files (PDF/A-3)"", ISO 19005-3 , 2012. 760 [PERMACC] "Perma.CC", n.d., . 762 [RFC-HISTORY] 763 RFC Editor, "Internet Archaeology: Documents from Early 764 History", n.d., . 766 [RFC-ONLINE] 767 RFC Editor, "History of RFC Online Project", n.d., 768 . 770 [RFC-PUB] RFC Editor, "RFC Editor Publication Process", n.d., 771 . 773 [RFCSERIES] 774 RFC Editor, "Overview of RFC Document Series", n.d., 775 . 777 [TLP] IETF Trust, "IETF Trust Legal Provisions", n.d., 778 . 781 [USLOC] Library of Congress, "Life Cycle Models for Digital 782 Stewardship", n.d., 783 . 786 [RFC5741] Daigle, L., Kolkman, O., and IAB, "RFC Streams, Headers, 787 and Boilerplates", RFC 5741, December 2009. 789 [RFC6635] Kolkman, O., Halpern, J., and IAB, "RFC Editor Model 790 (Version 2)", RFC 6635, June 2012. 792 [RFC6949] Flanagan, H. and N. Brownlee, "RFC Series Format 793 Requirements and Future Development", RFC 6949, May 2013. 795 Author's Address 797 Heather Flanagan 798 RFC Editor 800 Email: rse@rfc-editor.org