idnits 2.17.1 draft-iab-rfc-preservation-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 28, 2017) is 2614 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5741 (Obsoleted by RFC 7841) -- Obsolete informational reference (is this intentional?): RFC 6635 (Obsoleted by RFC 8728) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group H. Flanagan 3 Internet-Draft RFC Editor 4 Intended status: Informational February 28, 2017 5 Expires: September 1, 2017 7 Digital Preservation Considerations for the RFC Series 8 draft-iab-rfc-preservation-04 10 Abstract 12 The RFC Editor is both the publisher and the archivist for the RFC 13 Series. This document applies specifically to the archivist role of 14 the RFC Editor. It provides guidance on when and how to preserve 15 RFCs, and the tools required to view or re-create RFCs as necessary. 16 This document also highlights where gaps are in the current process, 17 and where compromises are suggested to balance cost with ideal best 18 practice. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on September 1, 2017. 37 Copyright Notice 39 Copyright (c) 2017 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 55 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 56 1.2. Life Cycle of Digital Preservation . . . . . . . . . . . 4 57 2. Updating Policy and Procedure . . . . . . . . . . . . . . . . 5 58 2.1. Acquisition of Documents . . . . . . . . . . . . . . . . 6 59 2.2. Ingestion of Documents . . . . . . . . . . . . . . . . . 6 60 2.3. Metadata and document registration . . . . . . . . . . . 7 61 2.4. Normalization and standardization of canonical file 62 structure and format . . . . . . . . . . . . . . . . . . 9 63 2.4.1. 'Best Effort' data retention . . . . . . . . . . . . 10 64 2.4.2. Single format for archival purposes . . . . . . . . . 11 65 2.4.3. Holistic archiving of the computing environment . . . 11 66 2.5. Transformation/migration to current publication formats . 12 67 2.6. System Parameters . . . . . . . . . . . . . . . . . . . . 13 68 2.7. Financial Planning . . . . . . . . . . . . . . . . . . . 13 69 3. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 14 70 4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 71 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 72 6. Security Considerations . . . . . . . . . . . . . . . . . . . 15 73 7. Informative References . . . . . . . . . . . . . . . . . . . 15 74 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 17 76 1. Introduction 78 The RFC Editor is both the publisher and the archivist for the RFC 79 Series, a series of technical specifications and policy documents 80 that includes foundational Internet standards [RFC6635] [RFCSERIES]. 81 As the publisher of these documents, the goal is to produce clear, 82 consistent, and readable documents for the community using as many 83 modern features, such as hyperlinks and content markup, within the 84 document as necessary to convey the information the authors intended 85 for their audience. As the archivist, however, the main goal is to 86 preserve both the information described and the documents themselves 87 for the indefinite future. To meet both of these goals, the RFC 88 Editor must find the necessary balance between the publication needs 89 of today and the archival needs of tomorrow, while acknowledging a 90 finite set of resources to complete both aspects of the RFC Editor 91 function. 93 While many files are created during the publication process, this 94 document focuses on the archival needs of RFCs and the Internet- 95 Drafts (I-Ds) that are approved for publication; I-Ds before they are 96 approved for publication by the appropriate stream-approving body are 97 out of scope. 99 To summarize, the key areas of tension between the roles of publisher 100 and archivist are: 102 o the desire of the publisher to meet the needs expressed by the 103 authors who want to use the latest technology within their 104 documents, such as vector graphics, live links, and a rich set of 105 metadata; 107 o the desire of the archivist to support only the simplest format 108 for documents possible--currently held by the Series to be ASCII- 109 only plain-text--so that the tools needed to view the documents 110 are equally simple and resistant to changes in technology, 111 resulting in a set of documents that will be easier to archive for 112 at least the next several decades if not centuries. 114 Through most of the history of the RFC Series, the file format for 115 RFCs has been plain text with an ASCII-only character set. This 116 choice offered the simplest format likely to remain available to the 117 largest number of consumers, and the one most likely to be resistant 118 to changes in technology over time. Increasingly, however, consumers 119 and authors are requesting additional features that would allow for 120 easy reading on a wider array of devices and retain all the metadata 121 an author intended in their document. In 2013, RFC 6949, "RFC Series 122 Format Requirements and Future Development," captured the high level 123 requirements for the Series; the fundamental issue being that the 124 plain-text, ASCII-only documents no longer met the needs of the 125 communities interested in using and producing RFCs [RFC6949]. 127 The assertion that plain-text, ASCII-only documents no longer meet 128 the needs of the community in turn suggests that the simple archive 129 process maintained by the RFC Editor is also no longer sufficient. 130 More complex tools and file formats require a more complex process to 131 make sure that RFCs can still be read and rendered far into the 132 future. This document describes the considerations that must inform 133 any changes in policy and procedure, and describes a model for the 134 RFC Series to follow when additional formats beyond the ASCII-only, 135 plain-text RFCs are published. The functional model that provides 136 the framework for the archival process described in this document was 137 derived from the ISO Open Archival Information System (OAIS) 138 Reference Model, defined in "Space data and information transfer 139 systems - Open archival information system (OAIS) - Reference model" 140 [ISO14721]. 142 1.1. Terminology 144 Acquisition: The point at which a document is accepted by the RFC 145 Editor for future inclusion into the archive. 147 Ingest: The point at which a digital object is assigned all necessary 148 metadata to describe the object and its contents, and added to the 149 archive. 151 Bit stream preservation: The process of storing and maintaining 152 digital objects over time, ensuring that there is no loss or 153 corruption of the bits making up those objects. 155 Content preservation: The retention of the ability to read, listen, 156 or watch a digital file in perpetuity. It is not about the bits 157 being stored; it is about being able to access and present those bits 158 to the user. 160 1.2. Life Cycle of Digital Preservation 162 The basic process for preserving digital information has been 163 described by a variety of organizations. From the Life cycle 164 Information For E-Literature (LIFE) project in the United Kingdom, to 165 the ongoing digital preservation work in the U.S. Library of 166 Congress, the basic digital preservation process is straightforward 167 [LIFE] [USLOC]. Documents are acquired and processed, metadata is 168 recorded, physical media is refreshed, and content is regularly 169 checked to see if it is still accessible by interested parties. The 170 complexities arise when one considers the need to preserve both the 171 bits of the digital objects themselves and the tools with which to 172 express those bits in an environment that experiences rapid changes 173 in technology. 175 For most of the existence of the RFC Series, the digital preservation 176 process has been fairly simple, focusing on bit stream preservation 177 and relying on paper copies of digital files. 179 The archival process for the RFC Series is as follows: 181 1. Acquisition: The RFC Editor database is updated to indicate an 182 Internet-Draft (I-D) has been approved for publication. At this 183 point, the document is taken through the editorial process on the 184 way to publication [RFC-PUB]. 186 2. Ingest: The RFC is added to the archive at the time of 187 publication. 189 3. Metadata creation: The details regarding an RFC, including RFC 190 number, author, title, abstract, etc., are created at time of 191 publication. Additional metadata in the form of status and 192 errata can be added or changed at any time, following the process 193 of the originating document stream. 195 4. Bit stream preservation: This part of the process is handled as 196 part of the IT system administration; all servers, disks, and 197 backup technology are refreshed on a regular cycle. 199 5. Content preservation: All RFCs since January 2010 are printed out 200 on standard office paper at time of publication, and the 201 electronic files preserved on disk and in backups with no 202 particular focus on preserving the entire computing environment 203 used to create the electronic documents. Most RFCs prior to 204 January 2010 are also available on paper, but there are gaps in 205 the record and issues of ownership around the paper copies before 206 that date. 208 When the format for RFCs transitions from plain-text, ASCII-only 209 files to an XML format with multiple outputs, the archival process 210 overall will become more complex. Additional metadata and some or 211 possibly all of the computing environment may need to be added to the 212 archive. 214 2. Updating Policy and Procedure 216 RFCs are created and published as digital objects. Unlike paper- 217 based publications, a digital collection requires a focus on 218 retaining the details of the technology as well as retaining the 219 object itself. Specifically, a digital archive needs to: 221 o consider the inherent instability of digital media; 223 o plan for a relatively short path to technological obsolescence; 225 o schedule regular media updates; 227 o apply predefined criteria for technology evaluation; and, 229 o ensure the continued authenticity and integrity of RFCs through 230 any changes in technology. 232 As the custodian and canonical source of RFCs and associated errata, 233 the RFC Editor must consider how to ensure the availability and 234 integrity of this document series far into the future and determine 235 whether the focus must be on bit stream preservation, content 236 preservation, or both. 238 The RFC Editor has several advantages in acting as the digital 239 archivist for the Series. Since the RFC Editor is the publisher as 240 well as the archivist, the RFC Editor controls the format of the 241 material, the process for adding those materials to an archive, and 242 can add any additional metadata considered necessary. External 243 materials, while a major consideration for more general archives, are 244 no longer accepted by the RFC Editor. (See "Internet Archaeology: 245 Documents from Early History" for the list of non-RFC digital objects 246 held by the RFC Editor [RFC-HISTORY].) 248 This document describes several different preservation models that 249 may fit the needs of the Series, and raises several points for 250 community consideration. Specifically, it covers information on: 252 o Acquisition of documents 254 o Ingestion of documents 256 o Metadata and document registration 258 o Normalization and standardization of canonical file structure and 259 format 261 o Transformation/migration to current publication formats 263 o Content and computing environment preservation 265 o System parameters 267 o Financial impact 269 2.1. Acquisition of Documents 271 The acquisition process for documents intended for the archive starts 272 with the submission of an approved I-D for publication. During the 273 editorial process, information such as the document metadata is 274 finalized prior to publication. The initial I-D as submitted and the 275 RFC produced from it do not formally enter the archive, however, 276 until the time of publication, which is considered the point of 277 ingestion from an archival perspective. 279 2.2. Ingestion of Documents 281 Once an RFC is published, the canonical format is considered 282 immutable. At this point, the RFC Production Center, one of the 283 internal roles within the RFC Editor, assigns the document metadata 284 an archivist needs to identify the unique object. 286 In the case of RFCs, the metadata assigned to a document at the time 287 of publication includes: 289 o the RFC number 291 o ISSN 293 o publication date 295 o Digital Object Identifier (DOI) 297 Additional metadata, such as author name, is assigned earlier in the 298 document creation process, but it is subject to change up to the 299 point of publication. More information on metadata is available in 300 section "Metadata and document registration." 302 In terms of deciding what to accept in the archive--a major question 303 for most archives, and yet simple for the RFC Series--the RFC Editor 304 accepts documents that are approved for publication by the stream 305 approving body of one of the document streams: the IETF, IAB, IRTF, 306 or Independent Submission streams [RFC5741]. Each document stream 307 has defined processes on when and how I-Ds are approved and submitted 308 to the RFC Editor for publication. The RFC Editor does not select 309 documents for publication and archiving; the RFC Editor edits and 310 publishes documents as directed by the document streams. 312 The RFC Editor holds no copyright on I-Ds or RFCs. As per the IETF 313 Trust Legal Provisions, the copyright for RFCs is held by the authors 314 and the IETF Trust [TLP]. At any point in time, the current entities 315 providing RFC Editor services must be able to release the archive of 316 RFCs to the IETF Trust. 318 Note: The RFC Editor is currently only responsible for RFCs; any 319 associated data sets or other research data is not considered within 320 the RFC Editor's mandate at this time and therefore no consideration 321 to the archival requirements of such datasets is covered in this 322 document. 324 2.3. Metadata and document registration 326 Metadata is data about data. In the field of digital archiving, this 327 is the data that clearly identifies every aspect of a document, from 328 its identifier (i.e., the RFC number, the I-D draft string) to the 329 size and file format of the document and more. Metadata is stored in 330 a central registry that stores information on what exactly is being 331 preserved, where it is located, information on authenticity and 332 provenance, and details on the hardware and/or software needed to 333 view or create the documents. 335 The RFC Editor maintains this registry in the form of a database that 336 includes all metadata available for documents engaged in the final 337 editing and publication process. This database feeds the search 338 engine on the RFC Editor website and the Info Pages available for 339 every RFC (e.g., http://www.rfc-editor.org/info/rfc####). 341 Current list of metadata presented in the RFC Info pages 343 o RFC number 345 o Canonical URI 347 o Title 349 o Status 351 o Updates 353 o Authors 355 o Stream 357 o Abstract 359 o Content-Type 361 o Character Set 363 o ISSN 365 o Publication date 367 o Digital Object Identifier (DOI) 369 Metadata to be added in the future 371 o Publication format URIs 373 Info pages also include links to: errata, IPR searches, plain text 374 and XML citation files. 376 In terms of best practice, all documents used as normative references 377 within an RFC would also be stored in the archive. While this is 378 done automatically when the normative reference is another RFC (the 379 usual case), retaining a copy of third-party documents is considered 380 out of scope for the RFC Editor. As the digital archive industry 381 stabilizes, services such as Perma.CC may be a reasonable compromise 382 [PERMACC]. Those services provide a permanent URI and image capture 383 of online documents, with a goal of buffering against URI and online 384 availability changes. 386 2.4. Normalization and standardization of canonical file structure and 387 format 389 The normalization process is perhaps the most technically critical 390 parts of digital archiving. The purpose here is content 391 preservation--making sure the data accepted for archiving are in the 392 most stable and easily accessed formats possible for the long-term 393 future, requiring the least amount of re-engineering and emulation of 394 environments in order to view the document in the future. 395 Normalization is about enabling long-term access to the information 396 within a document. 398 Over the history of the RFC Series, documents have been submitted for 399 publication in a variety of formats, including paper in the earliest 400 RFCs. Today, the majority of RFCs are available in both a canonical 401 plain-text format and PDF format. For exceptions to this list, see 402 the RFC Online Project [RFC-ONLINE]. 404 Currently, all RFCs are printed out to paper and stored at time of 405 publication. This has been a reasonable backup plan for several 406 decades. With few of the features one might expect from a digital 407 document format (including links, metadata within the document, or 408 line drawings), plain-text files do not lose much, if any, 409 information when printed out to paper. As the published formats 410 change (see RFC 6949), however, printing to paper provides less value 411 as much of the metadata that is an intrinsic yet invisible part of 412 the rendered document will be lost in such printing. With that in 413 mind, the focus needs to change on preserving the new file formats 414 electronically. 416 While each RFC today is printed to paper and all electronic versions 417 stored on multiple hard drives, no particular effort is made to 418 ensure copies of the software used to render or read the canonical 419 plain-text RFC are also archived. The RFC Editor has several choices 420 on how to adapt to a more complex set of data to archive and follow 421 best practice as defined by the digital archive community: 423 o a simplified bit stream preservation model that focuses on "best 424 effort" standard data retention practices, which rely on backups, 425 upgrades, and regular equipment change to preserve the data, and 426 assuming that emulators may be built when needed if the formats 427 used go out of common use (a significant part of the existing 428 model); 430 o a content preservation model that focuses on one publication 431 format as a version most likely to be viewable and provide all 432 necessary metadata in the future (a viable option considering the 433 fact that PDF/A-3--one of the intended publication formats--was 434 designed for this type of archiving) [PDF]; 436 o a complex bit stream and content preservation model that focuses 437 on archiving the canonical XML and the entire computing 438 environment required to create, view and render all outputs from 439 that file (the "best practice" when looking at this from an 440 archivist's perspective). 442 Those options are listed in order of least to greatest complexity and 443 expense. More detail on each option is described below. 445 2.4.1. 'Best Effort' data retention 447 When dealing with very simple data structures such as plain-text, 448 ASCII-only files, the experience of the RFC Series suggests that for 449 the last few decades, hardware and operating system changes have had 450 minimal impact on the document files being stored. While a complete 451 failure of an operating system migration in the past had corrupted 452 the data set, that situation represents a somewhat different problem 453 than the tools themselves changing such that plain-text files are not 454 easily read with existing technology. Given that the basic plain- 455 text format and ASCII encoding remain in common use, the standard 456 protections against file corruption and data loss, such as disk 457 mirroring, off-site backups, and periodic restoration testing will 458 continue to provide access to the entirety of the RFC Series for the 459 foreseeable future. As has been pointed out, both in this document 460 and in broader community discussion, that is not sufficient when one 461 moves into more complex formats such as XML, HTML, PDF, or other 462 proprietary formats offered by today's large IT companies. The risk 463 of technological change resulting in the file formats mentioned being 464 deprecated or changed without backwards compatibility is fairly high 465 when looking at a future of decades or centuries. 467 It is recommended that this model of archiving the RFC Series cease 468 to be the primary model after the plain-text, ASCII-only format is no 469 longer the canonical format. Best effort data retention is a 470 necessary but not sufficient level of effort for preserving a digital 471 archive. For more guidance on how to define best effort data 472 retention, the section on "Media and Formats, Summary 473 Recommendations" in the latest version of the Digital Preservation 474 Handbook provides useful and concrete information [DPC]. 476 2.4.2. Single format for archival purposes 478 If one ascribes to the idea that preserving the information described 479 by a document, rather than the document itself, is the primary 480 purpose of an archive, then focusing efforts on a single file format 481 is a reasonable option. Some well-supported archival tooling 482 projects follow this route, such as Archivematica 483 . By selecting a 484 feature-rich yet fundamentally stable file format for documents, an 485 organization may avoid expensive whole-environment reconstruction in 486 order to view the document. The PDF/A formats were designed to be an 487 archival format for electronic documents, and PDF/A-3 is one of the 488 options intended for publication as the RFC Series moves from a 489 plain-text canonical format to an XML canonical format with multiple 490 publication formats. A PDF/A-3 file can be produced that embeds the 491 XML from which the PDF/A-3 file was created, which in turn allows for 492 both original and rendered document validation--if one has the 493 correct tools available to see the source of the PDF/A-3 file 494 [I-D.iab-rfc-use-of-pdf]. The XML is not otherwise visible when 495 viewing the PDF/A-3 file through typical PDF reader software. 497 When looking at the need to archive RFCs in a resource-limited 498 environment, a content preservation-only model has merit, but it is 499 not without risks. First, PDF/A-3 will not be the canonical format, 500 but is intended to be one of the rendered outputs. It may contain 501 rendering bugs that were not intended to be in the document. Second, 502 while the various PDF/A formats were designed to be archival, it has 503 not been put to the test of time to determine if will actual live up 504 to its design goals. 506 It is a valid option to consider, but the risks, priorities, and 507 costs must be discussed by the community before a decision is made to 508 follow this path. The best option may be to combine this with one of 509 the other methods of archiving described in this document to help 510 minimize both risk and cost. 512 2.4.3. Holistic archiving of the computing environment 514 Preserving everything published through the RFC Editor in order to 515 have a permanent record of information, standards, and best practice, 516 is arguably the whole point of being an archival series. One can 517 argue that it is not only about the information described in an RFC, 518 it is also about supporting Intellectual Property Rights (IPR) and 519 retaining the history of the Internet. In following this model, 520 however, one must consider the complexity of the archival environment 521 as matching, and possibly exceeding, the complexity of the file 522 formats being preserved. 524 Consider a future where XML has been obsoleted for half a century, 525 HTML5 was a format used three to four human generations ago, and PDF/ 526 A-3 no longer supported by any existing company's reading software. 527 In order for RFCs that were produced with XML as their canonical 528 format, an archive must not only hold the data, it must also hold the 529 entire computing environment that allows the data to be rendered and 530 viewed. Operating systems and hardware on which those OSs can run, 531 each major version of each piece of software used or relied upon 532 during the publication of an RFC, browsers and readers for HTML, PDF, 533 and any other publication format, must be preserved in some fashion. 534 This is considered best practice when archiving digital documents. 535 It is also the most expensive, and the cost only increases over time 536 as more and more instances of the computing environment must be 537 preserved over the lifetime of the Series. 539 This is a valid option to consider, but the sheer scope of resources 540 required suggests that this must be discussed by the community before 541 a decision is made. Pursuing this may require an entirely different 542 paradigm for the RFC Editor than what has been considered in the 543 past; expanding the scope and resources for the RFC Editor, finding a 544 third-party to take over the responsibilities of archiving, or some 545 other option may be necessary. 547 2.5. Transformation/migration to current publication formats 549 Noting that normalization is a complex subject, it is important to 550 consider what to do to mitigate the risk of failure of the 551 normalization process. 553 The RFC Editor is responsible for making RFCs available to the 554 Internet community. The canonical version of an RFC does not change 555 once published; any formats officially rendered from the canonical 556 version, however, may change. One way to mitigate the need to 557 preserve the entire computing environment for an RFC, including web 558 browsers and PDF readers, would be to take advantage of the non- 559 canonical nature of the publication formats and re-render them from 560 the canonical source at the point that browser or reader technology 561 has changed sufficiently to make RFCs largely unavailable to 'modern' 562 tools. 564 For example, the RFC Editor may develop a practice of starting an 565 annual review of the tools needed to view the publication formats 566 created by the RFC Editor, and determine whether or not the current 567 common and popular reader technologies (i.e., web browsers, PDF 568 viewers, e-readers) can view the existing publication formats. 569 During that review, the RFC Editor would work with the community to 570 determine if the current publication formats meet the needs of the 571 community, and whether any should be retired or added to improve the 572 availability of information to the community at that time. 574 2.6. System Parameters 576 While the industry best practice on the backup and restoration of 577 data is not sufficient as a long-term archival solution, it is still 578 a necessary part of keeping the Series available now and into the 579 future. In the past, nearly 800 RFCs had to be manually transcribed 580 from paper back to electronic format due to a failed server migration 581 and insufficient backups. 583 The underlying servers hosting the tools, database, RFCs, and errata 584 are the physical link in the archive environment. While such systems 585 cannot and should not remain static and unchanging, there must be 586 clear documentation regarding the environment, in particular the 587 storage, backups, and recovery processes for all RFC-related 588 material. The documentation must include information on the refresh 589 cycle for the physical storage and backup media and describe a 590 regular cycle of data restoration and/or migration testing. 592 2.7. Financial Planning 594 Having a digital archive policy provides input into the budget 595 process. The main costs associated with digital archives come from 596 the complexity and quantity of the material being archived, as 597 described in the section on Normalization. To quote the Digital 598 Protection Conservancy Handbook: 600 The complexity of the material submitted and number of objects 601 acquired generally has more impact on costs than the total storage 602 size. The type and variety of formats accepted into the 603 repository will also affect cost, because for example proprietary 604 formats are likely to be more difficult and expensive to manage in 605 the long term. It may be possible to reduce costs by limiting the 606 formats the repository will accept, or transforming material into 607 a standard common format. This can be done to reduce the number 608 of file types and possibly reducing the storage size. However, it 609 is also necessary to realise that due to storage redundancies 610 required for back up each gigabyte of deposited data requires more 611 than one gigabyte of disk space in repository storage. -- 612 http://www.dpconline.org/advice/preservationhandbook/ 613 institutional-strategies/costs-and-business-modelling 615 Estimating potential costs and providing figures is outside of the 616 scope of this document, but it should be noted that costs are a major 617 factor when determining what level of archival practice an 618 organization will follow. 620 3. Recommendations 622 Given the need to balance cost and complexity with retention of 623 information for historic, legal, and informational purposes, 624 preservation efforts should focus on the XML canonical format files, 625 the PDF/A-3 format files, the xml2rfc tool and its documentation, and 626 at least two PDF reader applications capable of extracting the 627 embedded XML. Care should be taken that the software being included 628 in this archive has a provision for free copies for backup or archive 629 purposes. All other formats and the overall computing environment 630 should be stored as described in "best effort" data retention, which 631 should in turn be described in the appropriate vendor contract for 632 the RFC Publisher. 634 Particular preservation efforts should be made by: 636 o choosing a format designed for archiving RFCs (PDF/A-3) 638 o embedding the canonical XML format within the PDF/A-3 file for 639 RFCs 641 o retaining a copy of the plain-text or XML file submitted for 642 approved I-Ds 644 o retaining all major versions of the tools and their associated 645 documentation used to acquire and ingest an RFC 647 o retaining the final XML file as well as the PDF/A-3 file with the 648 embedded XML 650 o retaining at least two software reader applications to ensure the 651 PDF/A-3 and XML files can be viewed in the future 653 o partnering with other digital archives around the world to mirror 654 copies of the target data 656 In order to control costs and focus the archiving effort on the 657 entire content of an RFC, including the metadata and other features 658 embedded within each RFC published in more than just plain text, 659 printing each RFC upon publication to paper is no longer reasonable. 660 Proper data storage and mirrored copies of RFCs provides more 661 efficient and effective copies in case of catastrophic failure of the 662 existing archive of material. 664 Particular focus should be given to finding partners that specialize 665 in digital preservation to ingest RFCs. Ideally, they will ingest 666 all material associated with an RFC, including all metadata, digital 667 signatures, and the approved Internet-Draft that was submitted to the 668 RFC Editor. The possibilities and options should be discussed with 669 each archival partner; at minimum, they must ingest copies of RFCs as 670 they are published, with the basic metadata associated with each 671 document. 673 Preservation efforts should be reviewed and validated through a bi- 674 annual audit that will verify that the targeted content and all its 675 associated metadata can be read with existing tools. The full 676 process from acquisition to ingest should be reviewed to ensure that 677 best current practice is being followed from a digital archive 678 community perspective. Since the overall model for the RFC Editor- 679 maintained digital archive follows the OAIS Reference model, the 680 associated audit guidelines should be followed. While the RFC Editor 681 does not seek to be recognized as 'OAIS-compliant' at this time, use 682 of the ISO standard, "Audit and Certification of Trustworthy Digital 683 Repositories," would provide a solid, accepted method for structuring 684 an audit for this digital archive [ISO16363]. 686 4. Summary 688 The RFC Series is worth archiving. It contains the history of the 689 early Internet, as well as some of the key standards for Internet 690 technology and best practice today. Who knows what the community 691 will create in the future? There are many ways to preserve the 692 Series, from relying on preservation of the bits, to focusing on a 693 single file format, to preserving the entire computing environment. 694 Each possibility, or the permutations from them, involves risks and 695 varying levels of resources. The goal of this document is to 696 describe the possibilities and associated risks so that the community 697 can come to an informed decision regarding what they are willing to 698 see supported far into the future. 700 5. IANA Considerations 702 This document has no IANA actions. 704 6. Security Considerations 706 This document assumes that the origination of RFCs via the RFC Editor 707 is secure and trusted. With that assumption, the activities 708 discussed in this document do not affect the security of the 709 Internet. 711 7. Informative References 713 [I-D.iab-rfc-use-of-pdf] 714 Hansen, T., Masinter, L., and M. Hardy, "PDF for an RFC 715 Series Output Document Format", draft-iab-rfc-use-of- 716 pdf-02 (work in progress), May 2016. 718 [DPC] DigitalPreservationCoalition, "Digital Preservation 719 Handbook", 2012, 720 . 722 [ISO14721] 723 International Organization for Standardization, ""Space 724 data and information transfer systems -- Open archival 725 information system (OAIS) -- Reference model"", ISO 726 14721:2012 , 2012. 728 [ISO16363] 729 International Organization for Standardization, ""Space 730 data and information transfer systems -- Audit and 731 Certification of Trustworthy Digital Repositories"", ISO 732 16363:2011 , 2011. 734 [LIFE] Hole, B., "LIFE^3: Predictive Costing of Digital 735 Preservation", July 2010, 736 . 738 [PDF] International Organization for Standardization, 739 ""Electronic document file format for long-term 740 preservation -- Part 3: Use of ISO 32000-1 with support 741 for embedded files (PDF/A-3)"", ISO 19005-3 , 2012. 743 [PERMACC] "Perma.CC", n.d., . 745 [RFC-HISTORY] 746 RFC Editor, "Internet Archaeology: Documents from Early 747 History", n.d., . 749 [RFC-ONLINE] 750 RFC Editor, "History of RFC Online Project", n.d., 751 . 753 [RFC-PUB] RFC Editor, "RFC Editor Publication Process", n.d., 754 . 756 [RFCSERIES] 757 RFC Editor, "Overview of RFC Document Series", n.d., 758 . 760 [TLP] IETF Trust, "IETF Trust Legal Provisions", n.d., 761 . 764 [USLOC] Library of Congress, "Life Cycle Models for Digital 765 Stewardship", n.d., 766 . 769 [RFC5741] Daigle, L., Ed., Kolkman, O., Ed., and IAB, "RFC Streams, 770 Headers, and Boilerplates", RFC 5741, 771 DOI 10.17487/RFC5741, December 2009, 772 . 774 [RFC6635] Kolkman, O., Ed., Halpern, J., Ed., and IAB, "RFC Editor 775 Model (Version 2)", RFC 6635, DOI 10.17487/RFC6635, June 776 2012, . 778 [RFC6949] Flanagan, H. and N. Brownlee, "RFC Series Format 779 Requirements and Future Development", RFC 6949, 780 DOI 10.17487/RFC6949, May 2013, 781 . 783 Author's Address 785 Heather Flanagan 786 RFC Editor 788 Email: rse@rfc-editor.org 789 URI: http://orcid.org/0000-0002-2647-2220