idnits 2.17.1 draft-iab-rfc-preservation-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 14, 2016) is 2692 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5741 (Obsoleted by RFC 7841) -- Obsolete informational reference (is this intentional?): RFC 6635 (Obsoleted by RFC 8728) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group H. Flanagan 3 Internet-Draft RFC Editor 4 Intended status: Informational November 14, 2016 5 Expires: May 18, 2017 7 Digital Preservation Considerations for the RFC Series 8 draft-iab-rfc-preservation-02 10 Abstract 12 The RFC Editor is both the publisher and the archivist for the RFC 13 Series. This document applies specifically to the archivist role of 14 the RFC Editor. It provides guidance on when and how to preserve 15 RFCs, and the tools required to view or re-create RFCs as necessary. 16 This document also highlights where gaps are in the current process, 17 and where compromises are suggested to balance cost with ideal best 18 practice. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on May 18, 2017. 37 Copyright Notice 39 Copyright (c) 2016 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 55 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 56 1.2. Life Cycle of Digital Preservation . . . . . . . . . . . 4 57 2. Updating Policy and Procedure . . . . . . . . . . . . . . . . 5 58 2.1. Acquisition of Documents . . . . . . . . . . . . . . . . 6 59 2.2. Ingest of Documents . . . . . . . . . . . . . . . . . . . 6 60 2.3. Metadata and document registration . . . . . . . . . . . 7 61 2.4. Normalization and standardization of canonical file 62 structure and format . . . . . . . . . . . . . . . . . . 9 63 2.4.1. 'Best Effort' data retention . . . . . . . . . . . . 10 64 2.4.2. Single format for archival purposes . . . . . . . . . 11 65 2.4.3. Holistic archiving of the computing environment . . . 11 66 2.5. Transformation/migration to current publication formats . 12 67 2.6. System Parameters . . . . . . . . . . . . . . . . . . . . 13 68 2.7. Financial Planning . . . . . . . . . . . . . . . . . . . 13 69 3. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 14 70 4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 71 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 72 6. Security Considerations . . . . . . . . . . . . . . . . . . . 15 73 7. Draft Change Log . . . . . . . . . . . . . . . . . . . . . . 15 74 7.1. draft-flanagan-rfc-preservation-02 to draft-iab-rfc- 75 preservation-00 . . . . . . . . . . . . . . . . . . . . . 16 76 8. Informative References . . . . . . . . . . . . . . . . . . . 16 77 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 17 79 1. Introduction 81 The RFC Editor is both the publisher and the archivist for the RFC 82 Series, a series of technical specifications and policy documents 83 that includes foundational Internet standards [RFC6635] [RFCSERIES]. 84 As the publisher of these documents, the goal is to produce clear, 85 consistent, and readable documents for the community using as many 86 modern features, such as hyperlinks and content markup, within the 87 document as necessary to convey the information the authors intended 88 for their audience. As the archivist, however, the main goal is to 89 preserve both the information described and the documents themselves 90 for the indefinite future. To meet both of these goals, the RFC 91 Editor must find the necessary balance between the publication needs 92 of today and the archival needs of tomorrow, while acknowledging a 93 finite set of resources to complete both aspects of the RFC Editor 94 function. 96 While many files are created during the publication process, this 97 document focuses on the archival needs of RFCs and the Internet- 98 Drafts (I-Ds) that are approved for publication; I-Ds before they are 99 approved for publication by the appropriate stream-approving body are 100 out of scope. 102 To summarize, the key areas of tension between the roles of publisher 103 and archivist are: 105 o the desire of the publisher to meet the needs expressed by the 106 authors who want to use the latest technology within their 107 documents, such as vector graphics, live links, and a rich set of 108 metadata; 110 o the desire of the archivist to support only the simplest format 111 for documents possible--currently held by the Series to be ASCII- 112 only plain-text--so that the tools needed to view the documents 113 are equally simple and resistant to changes in technology, 114 resulting in a set of documents that will be easier to archive for 115 at least the next several decades if not centuries. 117 Through most of the history of the RFC Series, the file format for 118 RFCs has been plain text with an ASCII-only character set. This 119 choice offered the simplest format likely to remain available to the 120 largest number of consumers, and the one most likely to be resistant 121 to changes in technology over time. Increasingly, however, consumers 122 and authors are requesting additional features that would allow for 123 easy reading on a wider array of devices and retain all the metadata 124 an author intended in their document. In 2013, RFC 6949, "RFC Series 125 Format Requirements and Future Development," captured the high level 126 requirements for the Series; the fundamental issue being that the 127 plain-text, ASCII-only documents no longer met the needs of the 128 communities interested in using and producing RFCs [RFC6949]. 130 The assertion that plain-text, ASCII-only documents no longer meet 131 the needs of the community in turn suggests that the simple archive 132 process maintained by the RFC Editor is also no longer sufficient. 133 More complex tools and file formats require a more complex process to 134 make sure that RFCs can still be read and rendered far into the 135 future. This document describes the considerations that must inform 136 any changes in policy and procedure, and describes a model for the 137 RFC Series to follow when additional formats beyond the ASCII-only, 138 plain-text RFCs are published. The functional model that provides 139 the framework for the archival process described in this document was 140 derived from the ISO Open Archival Information System (OAIS) 141 Reference Model, defined in "Space data and information transfer 142 systems - Open archival information system (OAIS) - Reference model" 143 [ISO14721]. 145 1.1. Terminology 147 Acquisition: The point at which a document is accepted by the RFC 148 Editor for future inclusion into the archive. 150 Ingest: The point at which a digital object is assigned all necessary 151 metadata to describe the object and its contents, and added to the 152 archive. 154 Bit stream preservation: The process of storing and maintaining 155 digital objects over time, ensuring that there is no loss or 156 corruption of the bits making up those objects. 158 Content preservation: The retention of the ability to read, listen, 159 or watch a digital file in perpetuity. It is not about the bits 160 being stored; it is about being able to access and present those bits 161 to the user. 163 1.2. Life Cycle of Digital Preservation 165 The basic process for preserving digital information has been 166 described by a variety of organizations. From the Life cycle 167 Information For E-Literature (LIFE) project in the United Kingdom, to 168 the ongoing digital preservation work in the U.S. Library of 169 Congress, the basic digital preservation process is straightforward 170 [LIFE] [USLOC]. Documents are acquired and processed, metadata is 171 recorded, physical media is refreshed, and content is regularly 172 checked to see if it is still accessible by interested parties. The 173 complexities arise when one considers the need to preserve both the 174 bits of the digital objects themselves and the tools with which to 175 express those bits in an environment that experiences rapid changes 176 in technology. 178 For most of the existence of the RFC Series, the digital preservation 179 process has been fairly simple, focusing on bit stream preservation 180 and relying on paper copies of digital files. 182 The archival process for the RFC Series is as follows: 184 1. Acquisition: The RFC Editor database is updated to indicate an 185 Internet Draft (I-D) has been approved for publication. At this 186 point, the document is taken through the editorial process on the 187 way to publication [RFC-PUB]. 189 2. Ingest: The RFC is added to the archive at the time of 190 publication. 192 3. Metadata creation: The details regarding an RFC, including RFC 193 number, author, title, abstract, etc., are created at time of 194 publication. Additional metadata in the form of status and 195 errata can be added or changed at any time, following the process 196 of the originating document stream. 198 4. Bit stream preservation: This part of the process is handled as 199 part of the IT system administration; all servers, disks, and 200 backup technology are refreshed on a regular cycle. 202 5. Content preservation: All RFCs since January 2010 are printed out 203 on standard office paper at time of publication, and the 204 electronic files preserved on disk and in backups with no 205 particular focus on preserving the entire computing environment 206 used to create the electronic documents. Most RFCs prior to 207 January 2010 are also available on paper, but there are gaps in 208 the record and issues of ownership around the paper copies before 209 that date. 211 When the format for RFCs transitions from plain-text, ASCII-encoded 212 files to an XML format with multiple outputs, the archival process 213 overall will become more complex. Additional metadata and some or 214 possibly all of the computing environment may need to be added to the 215 archive. 217 2. Updating Policy and Procedure 219 RFCs are created and published as digital objects. Unlike paper- 220 based publications, a digital collection requires a focus on 221 retaining the details of the technology as well as retaining the 222 object itself. Specifically, a digital archive needs to: 224 o consider the inherent instability of digital media; 226 o plan for a relatively short path to technological obsolescence; 228 o schedule regular media updates; 230 o apply predefined criteria for technology evaluation; and, 232 o ensure the continued authenticity and integrity of RFCs through 233 any changes in technology. 235 As the custodian and canonical source of RFCs and associated errata, 236 the RFC Editor must consider how to ensure the availability and 237 integrity of this document series far into the future and determine 238 whether the focus must be on bit stream preservation, content 239 preservation, or both. 241 The RFC Editor has several advantages in acting as the digital 242 archivist for the Series. Since the RFC Editor is the publisher as 243 well as the archivist, the RFC Editor controls the format of the 244 material, the process for adding those materials to an archive, and 245 can add any additional metadata considered necessary. External 246 materials, while a major consideration for more general archives, are 247 no longer accepted by the RFC Editor. (See "Internet Archaeology: 248 Documents from Early History" for the list of non-RFC digital objects 249 held by the RFC Editor [RFC-HISTORY].) 251 This document describes several different preservation models that 252 may fit the needs of the Series, and raises several points for 253 community consideration. Specifically, it covers information on: 255 o Acquisition of documents 257 o Ingest of documents 259 o Metadata and document registration 261 o Normalization and standardization of canonical file structure and 262 format 264 o Transformation/migration to current publication formats 266 o Content and computing environment preservation 268 o System parameters 270 o Financial impact 272 2.1. Acquisition of Documents 274 The acquisition process for documents intended for the archive starts 275 with the submission of an approved I-D for publication. During the 276 editorial process, information such as the document metadata are 277 finalized prior to publication. The initial I-D as submitted and the 278 RFC produced from it do not formally enter the archive, however, 279 until the time of publication, which is considered the point of 280 ingest from an archival perspective. 282 2.2. Ingest of Documents 284 Once an RFC is published, the canonical format is considered 285 immutable. At this point, the RFC Production Center, one of the 286 internal roles within the RFC Editor, assigns the document metadata 287 an archivist needs to identify the unique object. 289 In the case of RFCs, the metadata is assigned to a document at the 290 time of publication includes: 292 o the RFC number 294 o ISSN 296 o publication date 298 o Digital Object Identifier (DOI) 300 Additional metadata, such as author name, is assigned earlier in the 301 document creation process, but it is subject to change up to the 302 point of publication. More information on metadata is available in 303 section "Metadata and document registration." 305 The publication of an RFC--the point at which responsibility for the 306 document moves to the RFC Publisher, another internal role within the 307 RFC Editor--starts the formal archival process for the documents. 309 In terms of deciding what to accept in the archive--a major question 310 for most archives, and yet simple for the RFC Series--the RFC Editor 311 accepts documents that are approved for publication by the stream 312 approving body of one of the document streams: the IETF, IAB, IRTF, 313 or Independent Submissions streams [RFC5741]. Each document stream 314 has defined processes on when and how I-Ds are approved and submitted 315 to the RFC Editor for publication. The RFC Editor does not select 316 documents for publication and archiving; the RFC Editor edits and 317 publishes documents as directed by the document streams. 319 The RFC Editor holds no copyright on I-Ds or RFCs. As per the IETF 320 Trust Legal Provisions, the copyright for RFCs is held by the authors 321 and the IETF Trust [TLP]. At any point in time, the current entities 322 providing RFC Editor services must be able to release the archive of 323 RFCs to the IETF Trust. 325 Note: The RFC Editor is currently only responsible for RFCs; any 326 associated data sets or other research data is not considered within 327 the RFC Editor's mandate at this time and therefore no consideration 328 to the archival requirements of such datasets is covered in this 329 document. 331 2.3. Metadata and document registration 333 Metadata is data about data. In the field of digital archiving, this 334 is the data that clearly identifies every aspect of a document, from 335 its identifier (i.e., the RFC number, the I-D draft string) to the 336 size and file format of the document and more. Metadata is stored in 337 a central registry that stores information on what exactly is being 338 preserved, where it is located, information on authenticity and 339 provenance, and details on the hardware and/or software needed to 340 view or create the documents. 342 The RFC Editor maintains this registry in the form of a database that 343 includes all metadata available for documents engaged in the final 344 editing and publication process. This database feeds the search 345 engine on the RFC Editor website and the Info Pages available for 346 every RFC (e.g., http://www.rfc-editor.org/info/rfc####). 348 Current list of metadata presented in the RFC Info pages 350 o RFC number 352 o Canonical URI 354 o Title 356 o Status 358 o Updates 360 o Authors 362 o Stream 364 o Abstract 366 o Content-Type 368 o Character Set 370 o ISSN 372 o Publication date 374 o Digital Object Identifier (DOI) 376 Metadata to be added in the future 378 o Publication format URIs 380 Info pages also include links to: errata, IPR searches, plain text 381 and XML citation files. 383 In terms of best practice, all documents used as normative references 384 within an RFC would also be stored in the archive. While this is 385 done automatically when the normative reference is another RFC (the 386 usual case), retaining a copy of third-party documents is considered 387 out of scope for the RFC Editor. As the digital archive industry 388 stabilizes, services such as Perma.CC may be a reasonable compromise 389 [PERMACC]. Those services provide a permanent URI and image capture 390 of online documents, with a goal of buffering against URI and online 391 availability changes. 393 2.4. Normalization and standardization of canonical file structure and 394 format 396 The normalization process is perhaps the most technically critical 397 parts of digital archiving. The purpose here is content 398 preservation--making sure the data accepted for archiving are in the 399 most stable and easily accessed formats possible for the long-term 400 future, requiring the least amount of re-engineering and emulation of 401 environments in order to view the document in the future. 402 Normalization is about enabling long-term access to the information 403 within a document. 405 Over the history of the RFC Series, documents have been submitted for 406 publication in a variety of formats, including paper in the earliest 407 RFCs. Today, the majority of RFCs are available in both a canonical 408 plain-text format and PDF format. For exceptions to this list, see 409 the RFC Online Project [RFC-ONLINE]. 411 Currently, all RFCs are printed out to paper and stored at time of 412 publication. This has been a reasonable backup plan for several 413 decades. With few of the features one might expect from a digital 414 document format (including links, metadata within the document, or 415 line drawings), plain-text files do not lose much, if any, 416 information when printed out to paper. As the published formats 417 change (see RFC 6949), however, printing to paper provides less value 418 as much of the metadata that is an intrinsic yet invisible part of 419 the rendered document will be lost in such printing. With that in 420 mind, the focus needs to change on preserving the new file formats 421 electronically. 423 While each RFC today is printed to paper and all electronic versions 424 stored on multiple hard drives, no particular effort is made to 425 ensure copies of the software used to render or read the canonical 426 plain-text RFC are also archived. The RFC Editor has several choices 427 on how to adapt to a more complex set of data to archive and follow 428 best practice as defined by the digital archive community: 430 o a simplified bit stream preservation model that focuses on "best 431 effort" standard data retention practices, which rely on backups, 432 upgrades, and regular equipment change to preserve the data, and 433 assuming that emulators may be built when needed if the formats 434 used go out of common use (a significant part of the existing 435 model); 437 o a content preservation model that focuses on one publication 438 format as a version most likely to be viewable and provide all 439 necessary metadata in the future (a viable option considering the 440 fact that PDF/A-3--one of the intended publication formats--was 441 designed for this type of archiving) [PDF]; 443 o a complex bit stream and content preservation model that focuses 444 on archiving the canonical XML and the entire computing 445 environment required to create, view and render all outputs from 446 that file (the "best practice" when looking at this from an 447 archivist's perspective). 449 Those options are listed in order of least to greatest complexity and 450 expense. More detail on each option is described below. 452 2.4.1. 'Best Effort' data retention 454 When dealing with very simple data structures such as plain-text, 455 ASCII-only files, the experience of the RFC Series suggests that for 456 the last few decades, hardware and operating system changes have had 457 minimal impact on the document files being stored. While a complete 458 failure of an operating system migration in the past had corrupted 459 the data set, that situation represents a somewhat different problem 460 than the tools themselves changing such that plain-text files are not 461 easily read with existing technology. Given that the basic plain- 462 text format and ASCII encoding remain in common use, the standard 463 protections against file corruption and data loss, such as disk 464 mirroring, off-site backups, and periodic restoration testing will 465 continue to provide access to the entirety of the RFC Series for the 466 foreseeable future. As has been pointed out, both in this document 467 and in broader community discussion, that is not sufficient when one 468 moves into more complex formats such as XML, HTML, PDF, or other 469 proprietary formats offered by today's large IT companies. The risk 470 of technological change resulting in the file formats mentioned being 471 deprecated or changed without backwards compatibility is fairly high 472 when looking at a future of decades or centuries. 474 It is recommended that this model of archiving the RFC Series cease 475 to be the primary model after the plain-text, ASCII-only format is no 476 longer the canonical format. Best effort data retention is a 477 necessary but not sufficient level of effort for preserving a digital 478 archive. For more guidance on how to define best effort data 479 retention, the section on "Media and Formats, Summary 480 Recommendations" in the latest version of the Digital Preservation 481 Handbook provides useful and concrete information [DPC]. 483 2.4.2. Single format for archival purposes 485 If one ascribes to the idea that preserving the information described 486 by a document, rather than the document itself, is the primary 487 purpose of an archive, then focusing efforts on a single file format 488 is a reasonable option. Some well-supported archival tooling 489 projects follow this route, such as Archivematica 490 https://www.archivematica.org/wiki/Main_Page . By selecting a 491 feature-rich yet fundamentally stable file format for documents, an 492 organization may avoid expensive whole-environment reconstruction in 493 order to view the document. The PDF/A formats were designed to be an 494 archival format for electronic documents, and PDF/A-3 is one of the 495 options intended for publication as the RFC Series moves from a 496 plain-text canonical format to an XML canonical format with multiple 497 publication formats. A PDF/A-3 file can be produced that embeds the 498 XML from which the PDF/A-3 file was created, which in turn allows for 499 both original and rendered document validation--if one has the 500 correct tools available to see the source of the PDF/A-3 file 501 [I-D.iab-rfc-use-of-pdf]. The XML is not otherwise visible when 502 viewing the PDF/A-3 file through typical PDF reader software. 504 When looking at the need to archive RFCs in a resource-limited 505 environment, a content preservation-only model has merit, but it is 506 not without risks. First, PDF/A-3 will not be the canonical format, 507 but is intended to be one of the rendered outputs. It may contain 508 rendering bugs that were not intended to be in the document. Second, 509 while the various PDF/A formats were designed to be archival, it has 510 not been put to the test of time to determine if will actual live up 511 to its design goals. 513 It is a valid option to consider, but the risks, priorities, and 514 costs must be discussed by the community before a decision is made to 515 follow this path. The best option may be to combine this with one of 516 the other methods of archiving described in this document to help 517 minimize both risk and cost. 519 2.4.3. Holistic archiving of the computing environment 521 Preserving everything published through the RFC Editor in order to 522 have a permanent record of information, standards, and best practice, 523 is arguably the whole point of being an archival series. One can 524 argue that it is not only about the information described in an RFC, 525 it is also about supporting Intellectual Property Rights (IPR) and 526 retaining the history of the Internet. In following this model, 527 however, one must consider the complexity of the archival environment 528 as matching, and possibly exceeding, the complexity of the file 529 formats being preserved. 531 Consider a future where XML has been obsoleted for half a century, 532 HTML5 was a format used three to four human generations ago, and PDF/ 533 A-3 no longer supported by any existing company's reading software. 534 In order for RFCs that were produced with XML as their canonical 535 format, an archive must not only hold the data, it must also hold the 536 entire computing environment that allows the data to be rendered and 537 viewed. Operating systems and hardware on which those OSs can run, 538 each major version of each piece of software used or relied upon 539 during the publication of an RFC, browsers and readers for HTML, PDF, 540 and any other publication format, must be preserved in some fashion. 541 This is considered best practice when archiving digital documents. 542 It is also the most expensive, and the cost only increases over time 543 as more and more instances of the computing environment must be 544 preserved over the lifetime of the Series. 546 This is a valid option to consider, but sheer scope of resources 547 required suggests that this must be discussed by the community before 548 a decision is made. Pursuing this may require an entirely different 549 paradigm for the RFC Editor than what has been considered in the 550 past; expanding the scope and resources for the RFC Editor, finding a 551 third-party to take over the responsibilities of archiving, or some 552 other option may be necessary. 554 2.5. Transformation/migration to current publication formats 556 Noting that normalization is a complex subject, it is important to 557 consider what to do to mitigate the risk of failure of the 558 normalization process. 560 The RFC Editor is responsible for making RFCs available to the 561 Internet community. The canonical version of an RFC does not change 562 once published; any formats officially rendered from the canonical 563 version, however, may change. One way to mitigate the need to 564 preserve the entire computing environment for an RFC, including web 565 browsers and PDF readers, would be to take advantage of the non- 566 canonical nature of the publication formats and re-render them from 567 the canonical source at the point that browser or reader technology 568 has changed sufficiently to make RFCs largely unavailable to 'modern' 569 tools. 571 For example, the RFC Editor may develop a practice of starting an 572 annual review of the tools needed to view the publication formats 573 created by the RFC Editor, and determine whether or not the current 574 common and popular reader technologies (i.e., web browsers, PDF 575 viewers, e-readers) can view the existing publication formats. 577 During that review, the RFC Editor would work with the community to 578 determine if the current publication formats meet the needs of the 579 community, and whether any should be retired or added to improve the 580 availability of information to the community at that time. 582 2.6. System Parameters 584 While the industry best practice on the backup and restoration of 585 data is not sufficient as a long-term archival solution, it is still 586 a necessary part of keeping the Series available now and into the 587 future. In the past, nearly 800 RFCs had to be manually transcribed 588 from paper back to electronic format due to a failed server migration 589 and insufficient backups. 591 The underlying servers hosting the tools, database, RFCs, and errata 592 are the physical link in the archive environment. While such systems 593 cannot and should not remain static and unchanging, there must be 594 clear documentation regarding the environment, in particular the 595 storage, backups, and recovery processes for all RFC-related 596 material. The documentation must include information on the refresh 597 cycle for the physical storage and backup media and describe a 598 regular cycle of data restoration and/or migration testing. 600 2.7. Financial Planning 602 Having a digital archive policy provides input into the budget 603 process. The main costs associated with digital archives come from 604 the complexity and quantity of the material being archived, as 605 described in the section on Normalization. To quote the Digital 606 Protection Conservancy Handbook: 608 The complexity of the material submitted and number of objects 609 acquired generally has more impact on costs than the total storage 610 size. The type and variety of formats accepted into the 611 repository will also affect cost, because for example proprietary 612 formats are likely to be more difficult and expensive to manage in 613 the long term. It may be possible to reduce costs by limiting the 614 formats the repository will accept, or transforming material into 615 a standard common format. This can be done to reduce the number 616 of file types and possibly reducing the storage size. However, it 617 is also necessary to realise that due to storage redundancies 618 required for back up each gigabyte of deposited data requires more 619 than one gigabyte of disk space in repository storage. -- 620 http://www.dpconline.org/advice/preservationhandbook/ 621 institutional-strategies/costs-and-business-modelling 623 Estimating potential costs and providing figures it outside of the 624 scope of this document, but it should be noted that costs are a major 625 factor when determining what level of archival practice an 626 organization will follow. 628 3. Recommendations 630 Given the need to balance cost and complexity with retention of 631 information for historic, legal, and informational purposes, 632 preservation efforts should focus on the XML canonical format files, 633 the PDF/A-3 format files, the xml2rfc tool and its documentation, and 634 at least two PDF reader applications capable of extracting the 635 embedded XML. Care should be taken that the software being included 636 in this archive has a provision for free copies for backup or archive 637 purposes. All other formats and the overall computing environment 638 should be stored as described in "best effort" data retention, which 639 should in turn be described in the appropriate vendor contract for 640 the RFC Publisher. 642 Particular preservation efforts should be made by: 644 o choosing a format designed for archiving RFCs (PDF/A-3) 646 o embedding the canonical XML format within the PDF/A-3 file for 647 RFCs 649 o retaining a copy of the plain-text or XML file submitted for 650 approved I-Ds 652 o retaining all major versions of the tools and their associated 653 documentation used to acquire and ingest an RFC 655 o retaining the final XML file as well as the PDF/A-3 file with the 656 embedded XML 658 o retaining at least two software reader applications to ensure the 659 PDF/A-3 and XML files can be viewed in the future 661 o partnering with other digital archives around the world to mirror 662 copies of the target data 664 In order to control costs and focus the archiving effort on the 665 entire content of an RFC, including the metadata and other features 666 embedded within each RFC published in more than just plain text, 667 printing each RFC upon publication to paper is no longer reasonable. 668 Proper data storage and mirrored copies of RFCs provides more 669 efficient and effective copies in case of catastrophic failure of the 670 existing archive of material. 672 Particular focus should be given to finding partners that specialize 673 in digital preservation to ingest RFCs. Ideally, they will ingest 674 all material associated with an RFC, including all metadata, and the 675 approved Internet-Draft that was submitted to the RFC Editor. The 676 possibilities and options should be discussed with each archival 677 partner; at minimum, they must ingest copies of RFCs as they are 678 published, with the basic metadata associated with each document. 680 Preservation efforts should be reviewed and validated through a bi- 681 annual audit that will verify that the targeted content and all its 682 associated metadata can be read with existing tools. The full 683 process from acquisition to ingest should be reviewed to ensure that 684 best current practice is being followed from a digital archive 685 community perspective. Since the overall model for the RFC Editor- 686 maintained digital archive follows the OAIS Reference model, the 687 associated audit guidelines should be followed. While the RFC Editor 688 does not seek to be recognized as 'OAIS-compliant' at this time, use 689 of the ISO standard, "Audit and Certification of Trustworthy Digital 690 Repositories," would provide a solid, accepted method for structuring 691 an audit for this digital archive [ISO16363]. 693 4. Summary 695 The RFC Series is worth archiving. It contains the history of the 696 early Internet, as well as some of the key standards for Internet 697 technology and best practice today. Who knows what the community 698 will create in the future? There are many ways to preserve the 699 Series, from relying on preservation of the bits, to focusing on a 700 single file format, to preserving the entire computing environment. 701 Each possibility, or the permutations from them, involves risks and 702 varying levels of resources. The goal of this document is to 703 describe the possibilities and associated risks so that the community 704 can come to an informed decision regarding what they are willing to 705 see supported far into the future. 707 5. IANA Considerations 709 None 711 6. Security Considerations 713 TBD 715 7. Draft Change Log 717 To be removed before publication 719 7.1. draft-flanagan-rfc-preservation-02 to draft-iab-rfc- 720 preservation-00 722 Life cycle: updated how paper is currently handled 724 Recommendations: clarified the PDF reader requirements, license 725 requirements. 727 8. Informative References 729 [I-D.iab-rfc-use-of-pdf] 730 Hansen, T., Masinter, L., and M. Hardy, "PDF for an RFC 731 Series Output Document Format", draft-iab-rfc-use-of- 732 pdf-02 (work in progress), May 2016. 734 [DPC] DigitalPreservationCoalition, "Digital Preservation 735 Handbook", 2012, 736 . 738 [ISO14721] 739 International Organization for Standardization, ""Space 740 data and information transfer systems -- Open archival 741 information system (OAIS) -- Reference model"", ISO 742 14721:2012 , 2012. 744 [ISO16363] 745 International Organization for Standardization, ""Space 746 data and information transfer systems -- Audit and 747 Certification of Trustworthy Digital Repositories"", ISO 748 16363:2011 , 2011. 750 [LIFE] Hole, B., "LIFE^3: Predictive Costing of Digital 751 Preservation", July 2010, 752 . 754 [PDF] International Organization for Standardization, 755 ""Electronic document file format for long-term 756 preservation -- Part 3: Use of ISO 32000-1 with support 757 for embedded files (PDF/A-3)"", ISO 19005-3 , 2012. 759 [PERMACC] "Perma.CC", n.d., . 761 [RFC-HISTORY] 762 RFC Editor, "Internet Archaeology: Documents from Early 763 History", n.d., . 765 [RFC-ONLINE] 766 RFC Editor, "History of RFC Online Project", n.d., 767 . 769 [RFC-PUB] RFC Editor, "RFC Editor Publication Process", n.d., 770 . 772 [RFCSERIES] 773 RFC Editor, "Overview of RFC Document Series", n.d., 774 . 776 [TLP] IETF Trust, "IETF Trust Legal Provisions", n.d., 777 . 780 [USLOC] Library of Congress, "Life Cycle Models for Digital 781 Stewardship", n.d., 782 . 785 [RFC5741] Daigle, L., Ed., Kolkman, O., Ed., and IAB, "RFC Streams, 786 Headers, and Boilerplates", RFC 5741, 787 DOI 10.17487/RFC5741, December 2009, 788 . 790 [RFC6635] Kolkman, O., Ed., Halpern, J., Ed., and IAB, "RFC Editor 791 Model (Version 2)", RFC 6635, DOI 10.17487/RFC6635, June 792 2012, . 794 [RFC6949] Flanagan, H. and N. Brownlee, "RFC Series Format 795 Requirements and Future Development", RFC 6949, 796 DOI 10.17487/RFC6949, May 2013, 797 . 799 Author's Address 801 Heather Flanagan 802 RFC Editor 804 Email: rse@rfc-editor.org 805 URI: http://orcid.org/0000-0002-2647-2220