idnits 2.17.1 draft-flanagan-rfc-preservation-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (December 2, 2014) is 3432 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5741 (Obsoleted by RFC 7841) -- Obsolete informational reference (is this intentional?): RFC 6635 (Obsoleted by RFC 8728) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group H. Flanagan 3 Internet-Draft RFC Editor 4 Intended status: Informational December 2, 2014 5 Expires: June 5, 2015 7 Digital Preservation Considerations for the RFC Series 8 draft-flanagan-rfc-preservation-02 10 Abstract 12 The RFC Editor is both the publisher and the archivist for the RFC 13 Series. This document applies specifically to the archivist role of 14 the RFC Editor. It provides guidance on when and how to preserve 15 RFCs, and the tools required to view or re-create RFCs as necessary. 16 This document also highlights where gaps are in the current process, 17 and where compromises are suggested to balance cost with ideal best 18 practice. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on June 5, 2015. 37 Copyright Notice 39 Copyright (c) 2014 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 55 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 56 1.2. Life cycle of Digital Preservation . . . . . . . . . . . 4 57 2. Updating Policy and Procedure . . . . . . . . . . . . . . . . 5 58 2.1. Acquisition of Documents . . . . . . . . . . . . . . . . 6 59 2.2. Ingest of Documents . . . . . . . . . . . . . . . . . . . 6 60 2.3. Metadata and document registration . . . . . . . . . . . 7 61 2.4. Normalization and standardization of canonical file 62 structure and format . . . . . . . . . . . . . . . . . . 9 63 2.4.1. 'Best Effort' data retention . . . . . . . . . . . . 10 64 2.4.2. Single format for archival purposes . . . . . . . . . 11 65 2.4.3. Holistic archiving of the computing environment . . . 11 66 2.5. Transformation/migration to current publication formats . 12 67 2.6. System Parameters . . . . . . . . . . . . . . . . . . . . 13 68 2.7. Financial Planning . . . . . . . . . . . . . . . . . . . 13 69 3. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 14 70 4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 71 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 72 6. Security Considerations . . . . . . . . . . . . . . . . . . . 15 73 7. Draft Change Log . . . . . . . . . . . . . . . . . . . . . . 15 74 7.1. -01 to -02 . . . . . . . . . . . . . . . . . . . . . . . 15 75 7.2. -00 to -01 . . . . . . . . . . . . . . . . . . . . . . . 15 76 8. Informative References . . . . . . . . . . . . . . . . . . . 15 77 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 17 79 1. Introduction 81 The RFC Editor is both the publisher and the archivist for the RFC 82 Series, a series of technical specifications and policy documents 83 that includes foundational Internet standards [RFC6635] [RFCSERIES]. 84 As the publisher of these documents, the goal is to produce clear, 85 consistent, and readable documents for the community using as many 86 modern features, such as hyperlinks and content markup, within the 87 document as necessary to convey the information the authors intended 88 for their audience. As the archivist, however, the main goal is to 89 preserve both the information described and the documents themselves 90 for the indefinite future. To meet both of these goals, the RFC 91 Editor must find the necessary balance between the publication needs 92 of today and the archival needs of tomorrow, while acknowledging a 93 finite set of resources to complete both aspects of the RFC Editor 94 function. 96 While many files are created during the publication process, this 97 document focuses on the archival needs of RFCs and the Internet- 98 Drafts (I-Ds) that are approved for publication; I-Ds before they are 99 approved for publication by the appropriate stream-approving body are 100 out of scope. 102 To summarize, the key areas of tension between the roles of publisher 103 and archivist are: 105 o the desire of the publisher to meet the needs expressed by the 106 authors who want to use the latest technology within their 107 documents, such as vector graphics, live links, and a rich set of 108 metadata; 110 o the desire of the archivist to support only the simplest format 111 for documents possible--currently held by the Series to be ASCII- 112 only plain-text--so that the tools needed to view the documents 113 are equally simple and resistant to changes in technology, 114 resulting in a set of documents that will be easier to archive for 115 at least the next several decades if not centuries. 117 Through most of the history of the RFC Series, the file format for 118 RFCs has been plain text with an ASCII-only character set. This 119 choice offered the simplest format likely to remain available to the 120 largest number of consumers, and the one most likely to be resistant 121 to changes in technology over time. Increasingly, however, consumers 122 and authors are requesting additional features that would allow for 123 easy reading on a wider array of devices and retain all the metadata 124 an author intended in their document. In 2013, RFC 6949, "RFC Series 125 Format Requirements and Future Development," captured the high level 126 requirements for the Series; the fundamental issue being that the 127 plain-text, ASCII-only documents no longer met the needs of the 128 communities interested in using and producing RFCs [RFC6949]. 130 The assertion that plain-text, ASCII-only documents no longer meet 131 the needs of the community in turn suggests that the simple archive 132 process maintained by the RFC Editor is also no longer sufficient. 133 More complex tools and file formats require a more complex process to 134 make sure that RFCs can still be read and rendered far into the 135 future. This document describes the considerations that must inform 136 any changes in policy and procedure, and describes a model for the 137 RFC Series to follow when additional formats beyond the ASCII-only, 138 plain-text RFCs are published. The functional model that provides 139 the framework for the archival process described in this document was 140 derived from the ISO Open Archival Information System (OAIS) 141 Reference Model, defined in "Space data and information transfer 142 systems - Open archival information system (OAIS) - Reference model" 143 [ISO14721]. 145 1.1. Terminology 147 Acquisition: The point at which a document is accepted by the RFC 148 Editor for future inclusion into the archive. 150 Ingest: The point at which a digital object is assigned all necessary 151 metadata to describe the object and its contents, and added to the 152 archive. 154 Bit stream preservation: The process of storing and maintaining 155 digital objects over time, ensuring that there is no loss or 156 corruption of the bits making up those objects. 158 Content preservation: The retention of the ability to read, listen, 159 or watch a digital file in perpetuity. It is not about the bits 160 being stored; it is about being able to access and present those bits 161 to the user. 163 1.2. Life cycle of Digital Preservation 165 The basic process for preserving digital information has been 166 described by a variety of organizations. From the Life cycle 167 Information For E-Literature (LIFE) project in the United Kingdom, to 168 the ongoing digital preservation work in the U.S. Library of 169 Congress, the basic digital preservation process is straightforward 170 [LIFE] [USLOC]. Documents are acquired and processed, metadata is 171 recorded, physical media is refreshed, and content is regularly 172 checked to see if it is still accessible by interested parties. The 173 complexities arise when one considers the need to preserve both the 174 bits of the digital objects themselves and the tools with which to 175 express those bits in an environment that experiences rapid changes 176 in technology. 178 For most of the existence of the RFC Series, the digital preservation 179 process has been fairly simple, focusing on bit stream preservation 180 and relying on paper copies of digital files. 182 The archival process for the RFC Series is as follows: 184 1. Acquisition: The RFC Editor database is updated to indicate an 185 Internet Draft (I-D) has been approved for publication. At this 186 point, the document is taken through the editorial process on the 187 way to publication [RFC-PUB]. 189 2. Ingest: The RFC is added to the archive at the time of 190 publication. 192 3. Metadata creation: The details regarding an RFC, including RFC 193 number, author, title, abstract, etc., are created at time of 194 publication. Additional metadata in the form of status and 195 errata can be added or changed at any time, following the process 196 of the originating document stream. 198 4. Bit stream preservation: This part of the process is handled as 199 part of the IT system administration; all servers, disks, and 200 backup technology are refreshed on a regular cycle. 202 5. Content preservation: All RFCs are printed out on paper at time 203 of publication, and the electronic files preserved on disk and in 204 backups with no particular focus on preserving the entire 205 computing environment used to create the electronic documents. 207 When the format for RFCs changes from plain-text, ASCII-encoded 208 files, the archival process overall will become more complex. 209 Additional metadata and some or possibly all of the computing 210 environment may need to be added to the archive. 212 2. Updating Policy and Procedure 214 RFCs are created and published as digital objects. Unlike paper- 215 based publications, a digital collection requires a focus on 216 retaining the details of the technology as well as retaining the 217 object itself. Specifically, a digital archive needs to: 219 o consider the inherent instability of digital media; 221 o plan for a relatively short path to technological obsolescence; 223 o schedule regular media updates; 225 o apply predefined criteria for technology evaluation; and, 227 o ensure the continued authenticity and integrity of RFCs through 228 any changes in technology. 230 As the custodian and canonical source of RFCs and associated errata, 231 the RFC Editor must consider how to ensure the availability and 232 integrity of this document series far into the future and determine 233 whether the focus must be on bit stream preservation, content 234 preservation, or both. 236 The RFC Editor has several advantages in acting as the digital 237 archivist for the Series. Since the RFC Editor is the publisher as 238 well as the archivist, the RFC Editor controls the format of the 239 material, the process for adding those materials to an archive, and 240 can add any additional metadata considered necessary. External 241 materials, while a major consideration for more general archives, are 242 no longer accepted by the RFC Editor. (See "Internet Archaeology: 243 Documents from Early History" for the list of non-RFC digital objects 244 held by the RFC Editor [RFC-HISTORY].) 246 This document describes several different preservation models that 247 may fit the needs of the Series, and raises several points for 248 community consideration. Specifically, it covers information on: 250 o Acquisition of documents 252 o Ingest of documents 254 o Metadata and document registration 256 o Normalization and standardization of canonical file structure and 257 format 259 o Transformation/migration to current publication formats 261 o Content and computing environment preservation 263 o System parameters 265 o Financial impact 267 2.1. Acquisition of Documents 269 The acquisition process for documents intended for the archive starts 270 with the submission of an approved I-D for publication. During the 271 editorial process, information such as the document metadata are 272 finalized prior to publication. The initial I-D as submitted and the 273 RFC produced from it do not formally enter the archive, however, 274 until the time of publication, which is considered the point of 275 ingest from an archival perspective. 277 2.2. Ingest of Documents 279 Once an RFC is published, the canonical format is considered 280 immutable. At this point, the RFC Production Center, one of the 281 internal roles within the RFC Editor, assigns the document metadata 282 an archivist needs to identify the unique object. 284 In the case of RFCs, the metadata is assigned to a document at the 285 time of publication includes: 287 o the RFC number 288 o ISSN 290 o publication date 292 o Digital Object Identifier (DOI) --future 294 Additional metadata, such as author name, is assigned earlier in the 295 document creation process, but it is subject to change up to the 296 point of publication. More information on metadata is available in 297 section "Metadata and document registration." 299 The publication of an RFC--the point at which responsibility for the 300 document moves to the RFC Publisher, another internal role within the 301 RFC Editor--starts the formal archival process for the documents. At 302 that time, the canonical document should be digitally signed. 303 Information regarding the signatures and how to verify them must be 304 made available on the RFC Editor website. 306 In terms of deciding what to accept in the archive--a major question 307 for most archives, and yet simple for the RFC Series--the RFC Editor 308 accepts documents that are approved for publication by the stream 309 approving body of one of the document streams: the IETF, IAB, IRTF, 310 or Independent Submissions streams [RFC5741]. Each document stream 311 has defined processes on when and how I-Ds are approved and submitted 312 to the RFC Editor for publication. The RFC Editor does not select 313 documents for publication and archiving; the RFC Editor edits and 314 publishes documents as directed by the document streams. 316 The RFC Editor holds no copyright on I-Ds or RFCs. As per the IETF 317 Trust Legal Provisions, the copyright for RFCs is held by the authors 318 and the IETF Trust [TLP]. At any point in time, the current entities 319 providing RFC Editor services must be able to release the archive of 320 RFCs to the IETF Trust. 322 Note: The RFC Editor is currently only responsible for RFCs; any 323 associated data sets or other research data is not considered within 324 the RFC Editor's mandate at this time and therefore no consideration 325 to the archival requirements of such datasets is covered in this 326 document. 328 2.3. Metadata and document registration 330 Metadata is data about data. In the field of digital archiving, this 331 is the data that clearly identifies every aspect of a document, from 332 its identifier (i.e., the RFC number, the I-D draft string) to the 333 size and file format of the document and more. Metadata is stored in 334 a central registry that stores information on what exactly is being 335 preserved, where it is located, information on authenticity and 336 provenance, and details on the hardware and/or software needed to 337 view or create the documents. 339 The RFC Editor maintains this registry in the form of a database that 340 includes all metadata available for documents engaged in the final 341 editing and publication process. This database feeds the search 342 engine on the RFC Editor website and the Info Pages available for 343 every RFC (e.g., http://www.rfc-editor.org/info/rfc####). 345 Current list of metadata presented in the RFC Info pages 347 o RFC number 349 o Canonical URI 351 o Title 353 o Status 355 o Updates 357 o Authors 359 o Stream 361 o Abstract 363 o Content-Type 365 o Character Set 367 o ISSN 369 o Publication date 371 Metadata to be added in the future 373 o Digital Object Identifier (DOI) 375 o Publication format URIs 377 Info pages also include links to: errata, IPR searches, plain text 378 and XML citation files. 380 In terms of best practice, all documents used as normative references 381 within an RFC would also be stored in the archive. While this is 382 done automatically when the normative reference is another RFC (the 383 usual case), retaining a copy of third-party documents is considered 384 out of scope for the RFC Editor. As the digital archive industry 385 stabilizes, services such as Perma.CC may be a reasonable compromise 386 [PERMACC]. Those services provide a permanent URI and image capture 387 of online documents, with a goal of buffering against URI and online 388 availability changes. 390 2.4. Normalization and standardization of canonical file structure and 391 format 393 The normalization process is perhaps the most technically critical 394 parts of digital archiving. The purpose here is content 395 preservation--making sure the data accepted for archiving are in the 396 most stable and easily accessed formats possible for the long-term 397 future, requiring the least amount of re-engineering and emulation of 398 environments in order to view the document in the future. 399 Normalization is about enabling long-term access to the information 400 within a document. 402 Over the history of the RFC Series, documents have been submitted for 403 publication in a variety of formats, including paper in the earliest 404 RFCs. Today, the majority of RFCs are available in both a canonical 405 plain-text format and PDF format. For exceptions to this list, see 406 the RFC Online Project [RFC-ONLINE]. 408 Currently, all RFCs are printed out to paper and stored at time of 409 publication. This has been a reasonable backup plan for several 410 decades. With few of the features one might expect from a digital 411 document format (including links, metadata within the document, or 412 line drawings), plain-text files do not lose much, if any, 413 information when printed out to paper. As the published formats 414 change (see RFC 6949), however, printing to paper provides less value 415 as much of the metadata that is an intrinsic yet invisible part of 416 the rendered document will be lost in such printing. With that in 417 mind, the focus needs to change on preserving the new file formats 418 electronically. 420 While each RFC today is printed to paper and all electronic versions 421 stored on multiple hard drives, no particular effort is made to 422 ensure copies of the software used to render or read the canonical 423 plain-text RFC are also archived. The RFC Editor has several choices 424 on how to adapt to a more complex set of data to archive and follow 425 best practice as defined by the digital archive community: 427 o a simplified bit stream preservation model that focuses on "best 428 effort" standard data retention practices, which rely on backups, 429 upgrades, and regular equipment change to preserve the data, and 430 assuming that emulators may be built when needed if the formats 431 used go out of common use (a significant part of the existing 432 model); 434 o a content preservation model that focuses on one publication 435 format as a version most likely to be viewable and provide all 436 necessary metadata in the future (a viable option considering the 437 fact that PDF/A-3--one of the intended publication formats--was 438 designed for this type of archiving) [PDF]; 440 o a complex bit stream and content preservation model that focuses 441 on archiving the canonical XML and the entire computing 442 environment required to create, view and render all outputs from 443 that file (the "best practice" when looking at this from an 444 archivist's perspective). 446 Those options are listed in order of least to greatest complexity and 447 expense. More detail on each option is described below. 449 2.4.1. 'Best Effort' data retention 451 When dealing with very simple data structures such as plain-text, 452 ASCII-only files, the experience of the RFC Series suggests that for 453 the last few decades, hardware and operating system changes have had 454 minimal impact on the document files being stored. While a complete 455 failure of an operating system migration in the past had corrupted 456 the data set, that situation represents a somewhat different problem 457 than the tools themselves changing such that plain-text files are not 458 easily read with existing technology. Given that the basic plain- 459 text format and ASCII encoding remain in common use, the standard 460 protections against file corruption and data loss, such as disk 461 mirroring, off-site backups, and periodic restoration testing will 462 continue to provide access to the entirety of the RFC Series for the 463 foreseeable future. As has been pointed out, both in this document 464 and in broader community discussion, that is not sufficient when one 465 moves into more complex formats such as XML, HTML, PDF, or other 466 proprietary formats offered by today's large IT companies. The risk 467 of technological change resulting in the file formats mentioned being 468 deprecated or changed without backwards compatibility is fairly high 469 when looking at a future of decades or centuries. 471 It is recommended that this model of archiving the RFC Series cease 472 to be the primary model after the plain-text, ASCII-only format is no 473 longer the canonical format. Best effort data retention is a 474 necessary but not sufficient level of effort for preserving a digital 475 archive. For more guidance on how to define best effort data 476 retention, the section on Media and Formats, Summary Recommendations, 477 in the latest version of the Digital Preservation Handbook provides 478 useful, concrete information [DPC]. 480 2.4.2. Single format for archival purposes 482 If one ascribes to the idea that preserving the information described 483 by a document, rather than the document itself, is the primary 484 purpose of an archive, then focusing efforts on a single file format 485 is a reasonable option. Some well-supported archival tooling 486 projects follow this route, such as Archivemetica 487 https://www.archivematica.org/wiki/Main_Page . By selecting a 488 feature-rich yet fundamentally stable file format for documents, an 489 organization may avoid expensive whole-environment reconstruction in 490 order to view the document. The PDF/A formats were designed to be an 491 archival format for electronic documents, and PDF/A-3 is one of the 492 options intended for publication as the RFC Series moves from a 493 plain-text canonical format to an XML canonical format with multiple 494 publication formats. A PDF/A-3 file can be produced that embeds the 495 XML from which the PDF/A-3 file was created, which in turn allows for 496 both original and rendered document validation--if one has the 497 correct tools available to see the source of the PDF/A-3 file 498 [draft-hansen-rfc-use-of-pdf]. 500 When looking at the need to archive RFCs in a resource-limited 501 environment, a content preservation-only model has merit, but it is 502 not without risks. First, PDF/A-3 will not be the canonical format, 503 but is intended to be one of the rendered outputs. It may contain 504 rendering bugs that were not intended to be in the document. Second, 505 while the various PDF/A formats were designed to be archival, it has 506 not been put to the test of time to determine if will actual live up 507 to its design goals. 509 It is a valid option to consider, but the risks, priorities, and 510 costs must be discussed by the community before a decision is made to 511 follow this path. The best option may be to combine this with one of 512 the other methods of archiving described in this document to help 513 minimize both risk and cost. 515 2.4.3. Holistic archiving of the computing environment 517 Preserving everything published through the RFC Editor in order to 518 have a permanent record of information, standards, and best practice, 519 is arguably the whole point of being an archival series. One can 520 argue that it is not only about the information described in an RFC, 521 it is also about supporting Intellectual Property Rights (IPR) and 522 retaining the history of the Internet. In following this model, 523 however, one must consider the complexity of the archival environment 524 as matching, and possibly exceeding, the complexity of the file 525 formats being preserved. 527 Consider a future where XML has been obsoleted for half a century, 528 HTML5 was a format used three to four human generations ago, and PDF/ 529 A-3 no longer supported by any existing company's reading software. 530 In order for RFCs that were produced with XML as their canonical 531 format, an archive must not only hold the data, it must also hold the 532 entire computing environment that allows the data to be rendered and 533 viewed. Operating systems and hardware on which those OSs can run, 534 each major version of each piece of software used or relied upon 535 during the publication of an RFC, browsers and readers for HTML, PDF, 536 and any other publication format, must be preserved in some fashion. 537 This is considered best practice when archiving digital documents. 538 It is also the most expensive, and the cost only increases over time 539 as more and more instances of the computing environment must be 540 preserved over the lifetime of the Series. 542 This is a valid option to consider, but sheer scope of resources 543 required suggests that this must be discussed by the community before 544 a decision is made. Pursuing this may require an entirely different 545 paradigm for the RFC Editor than what has been considered in the 546 past; expanding the scope and resources for the RFC Editor, finding a 547 third-party to take over the responsibilities of archiving, or some 548 other option may be necessary. 550 2.5. Transformation/migration to current publication formats 552 Noting that normalization is a complex subject, it is important to 553 consider what to do to mitigate the risk of failure of the 554 normalization process. 556 The RFC Editor is responsible for making RFCs available to the 557 Internet community. The canonical version of an RFC does not change 558 once published; any formats officially rendered from the canonical 559 version, however, may change. One way to mitigate the need to 560 preserve the entire computing environment for an RFC, including web 561 browsers and PDF readers, would be to take advantage of the non- 562 canonical nature of the publication formats and re-render them from 563 the canonical source at the point that browser or reader technology 564 has changed sufficiently to make RFCs largely unavailable to 'modern' 565 tools. 567 For example, the RFC Editor may develop a practice of starting an 568 annual review of the tools needed to view the publication formats 569 created by the RFC Editor, and determine whether or not the current 570 common and popular reader technologies (i.e., web browsers, PDF 571 viewers, e-readers) can view the existing publication formats. 572 During that review, the RFC Editor would work with the community to 573 determine if the current publication formats meet the needs of the 574 community, and whether any should be retired or added to improve the 575 availability of information to the community at that time. 577 2.6. System Parameters 579 While the industry best practice on the backup and restoration of 580 data is not sufficient as a long-term archival solution, it is still 581 a necessary part of keeping the Series available now and into the 582 future. In the past, nearly 800 RFCs had to be manually transcribed 583 from paper back to electronic format due to a failed server migration 584 and insufficient backups. 586 The underlying servers hosting the tools, database, RFCs, and errata 587 are the physical link in the archive environment. While such systems 588 cannot and should not remain static and unchanging, there must be 589 clear documentation regarding the environment, in particular the 590 storage, backups, and recovery processes for all RFC-related 591 material. The documentation must include information on the refresh 592 cycle for the physical storage and backup media and describe a 593 regular cycle of data restoration and/or migration testing. 595 2.7. Financial Planning 597 Having a digital archive policy provides input into the budget 598 process. The main costs associated with digital archives come from 599 the complexity and quantity of the material being archived, as 600 described in the section on Normalization. To quote the Digital 601 Protection Conservancy Handbook: 603 The complexity of the material submitted and number of objects 604 acquired generally has more impact on costs than the total storage 605 size. The type and variety of formats accepted into the 606 repository will also affect cost, because for example proprietary 607 formats are likely to be more difficult and expensive to manage in 608 the long term. It may be possible to reduce costs by limiting the 609 formats the repository will accept, or transforming material into 610 a standard common format. This can be done to reduce the number 611 of file types and possibly reducing the storage size. However, it 612 is also necessary to realise that due to storage redundancies 613 required for back up each gigabyte of deposited data requires more 614 than one gigabyte of disk space in repository storage. -- 615 http://www.dpconline.org/advice/preservationhandbook/ 616 institutional-strategies/costs-and-business-modelling 618 Estimating potential costs and providing figures it outside of the 619 scope of this document, but it should be noted that costs are a major 620 factor when determining what level of archival practice an 621 organization will follow. 623 3. Recommendations 625 Given the need to balance cost and complexity with retention of 626 information for historic, legal, and informational purposes, 627 preservation efforts should focus on the XML canonical format, the 628 PDF/A-3 format, the xml2rfc tool and its documentation, and at least 629 one PDF reader application. All other formats and the overall 630 computing environment should be stored as described in "best effort" 631 data retention, which should in turn be described in the appropriate 632 vendor contract for the RFC Publisher. 634 Particular preservation efforts should be made by: 636 o choosing a format designed for archiving RFCs (PDF/A-3) 638 o embedding the canonical XML format within the PDF/A-3 file for 639 RFCs 641 o adding a digital signature and checksum for the canonical XML and 642 the PDF/A-3 files 644 o retaining a copy of the plain-text or XML file submitted for 645 approved I-Ds 647 o retaining all major versions of the tools and their associated 648 documentation used to acquire and ingest an RFC 650 o retaining at least two software reader applications to ensure the 651 PDF/A-3 and XML files can be viewed in the future 653 o partnering with other digital archives around the world to mirror 654 copies of the target data 656 In order to control costs and focus the archiving effort on the 657 entire content of an RFC, including the metadata and other features 658 embedded within each RFC published in more than just plain text, 659 printing each RFC upon publication to paper is no longer reasonable. 660 Proper data storage and mirrored copies of RFCs provides more 661 efficient and effective copies in case of catastrophic failure of the 662 existing archive of material. 664 Preservation efforts should be reviewed and validated through a bi- 665 annual audit that will verify that the targeted content and all its 666 associated metadata can be read with existing tools. The full 667 process from acquisition to ingest should be reviewed to ensure that 668 best current practice is being followed from a digital archive 669 community perspective. Since the overall model for the RFC Editor- 670 maintained digital archive follows the OAIS Reference model, the 671 associated audit guidelines should be followed. While the RFC Editor 672 does not seek to be recognized as 'OAIS-compliant' at this time, use 673 of the ISO standard, "Audit and Certification of Trustworthy Digital 674 Repositories," would provide a solid, accepted method for structuring 675 an audit for this digital archive [ISO16363]. 677 4. Summary 679 The RFC Series is worth archiving. It contains the history of the 680 early Internet, as well as some of the key standards for Internet 681 technology and best practice today. Who knows what the community 682 will create in the future? There are many ways to preserve the 683 Series, from relying on preservation of the bits, to focusing on a 684 single file format, to preserving the entire computing environment. 685 Each possibility, or the permutations from them, involves risks and 686 varying levels of resources. The goal of this document is to 687 describe the possibilities and associated risks so that the community 688 can come to an informed decision regarding what they are willing to 689 see supported far into the future. 691 5. IANA Considerations 693 None 695 6. Security Considerations 697 TBD 699 7. Draft Change Log 701 To be removed before publication 703 7.1. -01 to -02 705 Updated text where appropriate to indicate approved I-Ds should also 706 be targeted for archiving 708 7.2. -00 to -01 710 Recommendations: added the requirement to archive reader software, 711 and to stop printing out to paper 713 8. Informative References 715 [draft-hansen-rfc-use-of-pdf] 716 Hansen, T., Masinter, L., and M. Hardy, "PDF for an RFC 717 Series Output Document Format", draft-hansen-rfc-use-of- 718 pdf-02 , July 2014. 720 [DPC] DigitalPreservationCoalition, "Digital Preservation 721 Handbook", 2012, 722 . 724 [ISO14721] 725 International Organization for Standardization, ""Space 726 data and information transfer systems -- Open archival 727 information system (OAIS) -- Reference model"", ISO 728 14721:2012 , 2012. 730 [ISO16363] 731 International Organization for Standardization, ""Space 732 data and information transfer systems -- Audit and 733 Certification of Trustworthy Digital Repositories"", ISO 734 16363:2011 , 2011. 736 [LIFE] Hole, B., "LIFE^3: Predictive Costing of Digital 737 Preservation", July 2010, 738 . 740 [PDF] International Organization for Standardization, 741 ""Electronic document file format for long-term 742 preservation -- Part 3: Use of ISO 32000-1 with support 743 for embedded files (PDF/A-3)"", ISO 19005-3 , 2012. 745 [PERMACC] "Perma.CC", n.d., . 747 [RFC-HISTORY] 748 RFC Editor, "Internet Archaeology: Documents from Early 749 History", n.d., . 751 [RFC-ONLINE] 752 RFC Editor, "History of RFC Online Project", n.d., 753 . 755 [RFC-PUB] RFC Editor, "RFC Editor Publication Process", n.d., 756 . 758 [RFCSERIES] 759 RFC Editor, "Overview of RFC Document Series", n.d., 760 . 762 [TLP] IETF Trust, "IETF Trust Legal Provisions", n.d., 763 . 766 [USLOC] Library of Congress, "Life Cycle Models for Digital 767 Stewardship", n.d., 768 . 771 [RFC5741] Daigle, L., Kolkman, O., and IAB, "RFC Streams, Headers, 772 and Boilerplates", RFC 5741, December 2009. 774 [RFC6635] Kolkman, O., Halpern, J., and IAB, "RFC Editor Model 775 (Version 2)", RFC 6635, June 2012. 777 [RFC6949] Flanagan, H. and N. Brownlee, "RFC Series Format 778 Requirements and Future Development", RFC 6949, May 2013. 780 Author's Address 782 Heather Flanagan 783 RFC Editor 785 Email: rse@rfc-editor.org