idnits 2.17.1 draft-pwid-urn-specification-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 16, 2018) is 2104 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force E. Zierau, Ed. 3 Internet-Draft Royal Danish Library 4 Intended status: Informational July 16, 2018 5 Expires: January 17, 2019 7 A Persistent Web IDentifier (PWID) URN Namespace 8 draft-pwid-urn-specification-03 10 Abstract 12 This document specifies a Uniform Resource Name (URN) for Persistent 13 Web IDentifiers to web material in web archives using the 'pwid' 14 namespace identifier. The purpose of the standard is to support 15 general exact referencing method which includes support for 16 references to archives with restricted access, for exact references 17 to existing web material, and for exact specification of elements in 18 a web corpus (possibly spanning over several web archives). The PWID 19 URN therefore offers a scheme to make references that are not 20 currently supported. 22 The PWID is designed for researchers and therefore it is designed as 23 general, global, sustainable, humanly readable, technology agnostic, 24 persistent and precise web references for web materials in web 25 archives, and in a way that can make them potentially resolvable. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on January 17, 2019. 44 Copyright Notice 46 Copyright (c) 2018 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (https://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 62 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 63 2. Namespace Registration Template . . . . . . . . . . . . . . . 4 64 3. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 15 65 4. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 66 4.1. Normative References . . . . . . . . . . . . . . . . . . 15 67 4.2. Informative References . . . . . . . . . . . . . . . . . 16 68 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 18 70 1. Introduction 72 The purpose of the PWID URN is to represent general, global, 73 sustainable, humanly readable, technology agnostic, persistent and 74 precise web archive resource references in a way that; 76 o can be used for technical solutions e.g. to make them resolvable 78 o can cover references to all sorts of materials in web archives 80 o can cover references to materials from all sort of web archives 82 The motivation for defining a PWID namespace is the growing challenge 83 of references to archived web resources, which the PWID as a URN can 84 assist in overcoming. The standard is needed to address web 85 materials meeting precision and persistency issues on par precision 86 in with traditional references for analogue material. Furthermore, 87 it is needed in order to address web archive resources that are not 88 freely available online. The PWID URN covers both referencing of web 89 resources from research papers and definition of web collection/ 90 corpus. In detail the challenges are: 92 o Citation guidelines generally do not cover general and persistent 93 referencing techniques for web resources that are not registered 94 by Persistent Identifier systems (like DOI [DOI]). However, an 95 increasing number of references point to resources that only exist 96 on the web, e.g. blogs that turned out to have a historical 97 impact. In order to obtain persistency for a reference, the 98 target need to be stable. As the live web is 'alive' and in 99 constant change, persistency can only be obtained by referring to 100 archived snapshots of the web. The PWID URN is therefore focused 101 on referencing archived web material in a technology agnostic way 102 (research documented in [IPRES] and [ResawRef]). 104 o There are many new initiatives for web archive referencing, - most 105 of them are centralised solutions which offers harvest and 106 referencing, but these cannot be used for existing materials in 107 web archives. Other initiatives only cover open web archives, 108 which does not cover material in archives with restricted access 109 and where there is a risk of imprecision if a resource in an 110 alternative archive is the result of resolving such a resource. 111 The PWID URN is needed in order to fill these gaps where other 112 techniques are not sufficient. 114 o There are many different requirements for construction of 115 collection definitions for web material besides precision and 116 persistency. Recent research have found that various legal and 117 sustainability issues leads to a need for a collection to be 118 defined by references to the web parts in the collection. The 119 PWID URN is needed in such definitions in order to fulfil these 120 requirements and to enable a collection to cover web materials 121 from more archives (Research documented in [ResawColl]). 123 The PWID is especially useful for web material where precision is in 124 focus and/or there are references to materials from web archives 125 requiring special grants in order to gain access. The precision 126 regards both regards precise reference where there can be no doubt 127 about that you have the correct web material as well as precision 128 about what is actually referred by the reference (e.g. is it the page 129 or the whole website) 131 Furthermore the PWID is very useful in specification of contents of a 132 web collection (also known as web corpus). Definitions of web 133 collections are often needed for extraction of data used in 134 production of research results, e.g. for evaluations in the future. 135 Current practices today are not persistent as they often use some CDX 136 version, which vary for different implementations. 138 For the sake of usability and sustainability, the definition of the 139 PWID URN is focused on only having the minimum required information 140 to make a precise identification of a resource in an arbitrary web 141 archive. Resent research have found that this is obtain by the 142 following information [ResawRef]: 144 o Identification of web archive 145 o Identification of source: 147 * Archived URI or identifier 149 * Archival timestamp 151 o Intended coverage (page, part, subsite etc.) 153 The PWID URN represents this information in an unambiguous way, and 154 thus enabling technical solutions to be defined in this URN. 156 1.1. Requirements Language 158 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 159 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 160 document are to be interpreted as described in [RFC2119]. 162 2. Namespace Registration Template 164 Namespace Identifier: 166 PWID 168 Version: 170 3 172 Date: 174 2018-07-13 176 Registrant: 178 Eld Maj-Britt Olmuetz Zierau 179 Royal Danish Library 180 Soeren Kierkegaards Plads 1 181 1219 Copenhagen 182 Denmark 183 ph: +45 9132 4690 184 email: elzi@kb.dk 186 Purpose: 188 The purpose of the PWID URN is to represent general, global, 189 sustainable, humanly readable, technology agnostic, persistent and 190 precise web archive resource references in a way that: 192 * can be used for technical solutions e.g. to make them 193 resolvable 195 * can cover references to all sorts of materials in web archives 197 * can cover references to materials from all sort of web archives 199 The motivation for defining a PWID namespace is the growing 200 challenge of references to archived web resources, which the PWID 201 as a URN can assist in overcoming. The standard is needed to 202 address web materials meeting precision and persistency issues on 203 par precision in with traditional references for analogue 204 material. Furthermore, it is needed in order to address web 205 archive resources that are not freely available online. This 206 regards both referencing of web resources from research papers and 207 definition of web collection/corpus. In detail the challenges 208 are: 210 * Citation guidelines generally do not cover general and 211 persistent referencing techniques for web resources that are 212 not registered by Persistent Identifier systems (like DOI 213 [DOI]). However, an increasing number of references point to 214 resources that only exist on the web, e.g. blogs that turned 215 out to have a historical impact. In order to obtain 216 persistency for a reference, the target need to be stable. As 217 the live web is 'alive' and in constant change, persistency can 218 only be obtained by referring to archived snapshots of the web. 219 The PWID URN is therefore focused on referencing archived web 220 material in a technology agnostic way (research documented in 221 [IPRES] and [ResawRef]). 223 * There are many new initiatives for web archive referencing, - 224 most of them are centralised solutions which offers harvest and 225 referencing, but these cannot be used for existing materials in 226 web archives. Other initiatives only cover open web archives, 227 which does not cover material in archives with restricted 228 access and where there is a risk of imprecision if a resource 229 in an alternative archive is the result of resolving such a 230 resource. The PWID URN is needed in order to fill these gaps 231 where other techniques are not sufficient. 233 * There are many different requirements for construction of 234 collection definitions for web material besides precision and 235 persistency. Recent research have found that various legal and 236 sustainability issues leads to a need for a collection to be 237 defined by references to the web parts in the collection. The 238 PWID URN is needed in such definitions in order to fulfil these 239 requirements and to enable a collection to cover web materials 240 from more archives (research documented in [ResawColl]). 242 The PWID is especially useful for web material where precision is 243 in focus and/or there are references to materials from web 244 archives requiring special grants in order to gain access. The 245 precision regards both regards precise reference where there can 246 be no doubt about that you have the correct web material as well 247 as precision about what is actually referred by the reference 248 (e.g. is it the page or the whole website) 250 Furthermore the PWID is very useful in specification of contents 251 of a web collection (also known as web corpus). Definitions of 252 web collections are often needed for extraction of data used in 253 production of research results, e.g. for evaluations in the 254 future. Current practices today are not persistent as they often 255 use some CDX version, which vary for different implementations. 257 Strict unambiguous syntax is needed for the PWID reference in 258 order to ensure that it can be used for computational purposes. 259 This is relevant for web collection definitions, which will need a 260 strict syntax in order to be a basis for automatic extraction. 261 Furthermore, readers of research papers are today expecting to be 262 able to access a referenced resource by clicking an actionable 263 URI, therefore a similar facility will be expected for references 264 to available archived web material, which strict syntax can make 265 possible. Examples of technical solutions that is enabled by 266 strict are: 268 * resolving of a references and automatic extraction of web 269 collection defined by PWID URNs [ResawRef] [ResawColl] 271 * Resolving of a PWID reference by resolving services. As a 272 start, there is work on a prototype that can work for the 273 Danish web archive data and open web archives with standard 274 patterns for the current technologies. There may come 275 different implementations for resolving which may rely on 276 different protocols and application 278 The purpose of the PWID is also to express a web archive reference 279 as simple as possible and at the same time meeting requirements 280 for sustainability, usability and scope. Therefore, the PWID URN 281 is focused on only having the minimum required information to make 282 a precise identification of a resource in an arbitrary web 283 archive. Resent research have found that this is obtain by the 284 following information [ResawRef]: 286 * Identification of web archive 287 * Identification of source: 289 + Archived URI or identifier 291 + Archival timestamp 293 * Intended coverage (page, part, subsite etc.) 295 The PWID URN represents this information in an unambiguous way, 296 and thus enabling technical solutions to be defined in this URN. 298 Syntax: 300 The syntax of the PWID URN is specified below in Augmented Backus- 301 Naur Form (ABNF) [RFC5234] and it conforms to URN syntax defined 302 in RFC 8141 [RFC8141]. The syntax definition of the PWID URN is: 304 pwid-urn = "urn" ":" pwid-NID ":" pwid-NSS 306 pwid-NID = "pwid" 307 pwid-NSS = archive-id ":" archival-time ":" coverage-spec 308 ":" archived-item 310 archive-id = +( unreserved ) 312 archival-time = full-date datetime-delim full-pwid-time 313 datetime-delim = "T" 314 full-pwid-time = time-hour [":"] time-minute 315 [":"] time-second "Z" 317 coverage-spec = "part" / "page" / "subsite" / "site" 318 / "collection" / "recording" / "snapshot" 319 / "other" 321 archived-item = URI / archived-item-id 322 archived-item-id = +( unreserved ) 324 where 326 * 'unreserved' is defined as in RFC 3986 [RFC3986] 328 * 'coverage-spec' values are not case sensitive (i.e. "PAGE" / 329 "PART" / "PaGe" / ... are valid values as well.) 331 * 'archival-time' is a UTC timestamp conforming to the W3C 332 profile ISO 8601 [ISO8601] (also defined in RFC 3339 333 [RFC3339]), with a few exception. It has to be a UTC timestamp 334 in order to conform with web archiving practices, which always 335 uses UTC in order to avoid confusions. The 'full-date' is 336 defined as in RFC 3339 [RFC3339]. The 'archival-time' must 337 represent the time specified in the archive, and can therefore 338 be specified at any of the levels of granularity as described 339 in [W3CDTF] and in accordance with teh WARC standard ISO 28500 340 [ISO28500]. 342 In line with RFC 3339 [RFC3339] the "T" may alternatively be 343 lower case "t". 345 'time-hour', 'time-minute' and 'time-second' are defined as in 346 RFC 3339 [RFC3339]. 348 In line with RFC 3339 [RFC3339] the "Z" may alternatively be 349 lower case "z". 351 * 'URI' is defined as in RFC 3986 [RFC3986] 353 The 'coverage-spec' defines the type of archived item, serving as 354 a precision to what is referred: 356 * part 357 the single archived element, e.g. a pdf, a html text, an image 359 * page 360 the full context as a page, e.g. a html page with referred 361 images 363 * subsite 364 the full context as a subsite within its domain, e.g. a 365 document represented in a web structure 367 * site 368 the full context as a site within its domain 370 * collection 371 a collection/corpora definition, e.g. defined as descibed in 372 [ResawColl] 374 * snapshot 375 a snapshot (image) representation of web material, e.g. a web 376 page 378 * recording 379 a recording of a web browsing 381 * other 382 if something else 384 Assignment: 386 The PWID URNs does not have to be assigned by an authority, as 387 they are based on the information created at the time of 388 archiving: 390 * Identification of web archive 392 * Identification of source: 394 + Archived URI or identifier 396 + Archival timestamp 398 * Intended coverage (page, part, subsite etc.) 400 The rest of the PWID URN 402 * Intended coverage (page, part, subsite etc.) 404 is specifying what the user of the PWID URN wants to be focused on 405 - and may later be used for how a resource is displayed. However 406 it is not part of the actual location of the resource. 408 In other words: the PWID URNs are created independently, but 409 following an algorithm that itself guarantees uniqueness. 411 In this version of the standard, it is recommededto use the web 412 domain as the identifier for the web archive. This is 413 recommended, since it currently implicitly provides information 414 about the web archive. Furthermore, it is more precise than e.g. 415 the name of the archive, since there may be more than one 416 installation of web archives in the same organisation, e.g. 417 archive.org and archive-it.org are both covered by Internet 418 Archive. 420 Currently, there is also a prototype for a SOLR-Wayback tool 421 (Source at https://github.com/netarchivesuite/solrwayback) 422 [PWIDprovider], which can assist in finding the most precise 423 reference to an archived web page by provideing all PWIDs belongig 424 to it. For example, in archive: netarkivet.dk, archived URI: 425 http://www.susanlegetoej.dk/shop/handskedyr-siameser-killing- 426 8681p.html archiving time: 2008-11-29 01:19:16 UTC, [web page], 427 has the parts: 429 urn:pwid:netarkivet.dk:2008-11- 430 29T00:41:42Z:part:http://www.susanlegetoej.dk/images/ddcss/ 431 SK113_Master_NF.css 432 urn:pwid:netarkivet.dk:2008-11- 433 29T00:39:47Z:part:http://www.susanlegetoej.dk/shop/css/ 434 print.css 436 urn:pwid:netarkivet.dk:2008-11- 437 29T00:40:06Z:part:http://www.susanlegetoej.dk/images/ddcss/ 438 SK113_Basket_NF.css 440 urn:pwid:netarkivet.dk:2008-11- 441 29T00:40:00Z:part:http://www.susanlegetoej.dk/images/ddcss/ 442 SK113_TopMenu_NF.css 444 urn:pwid:netarkivet.dk:2008-11- 445 29T00:40:00Z:part:http://www.susanlegetoej.dk/images/ddcss/ 446 SK113_SearchPage_NF.css 448 urn:pwid:netarkivet.dk:2008-11- 449 29T00:40:35Z:part:http://www.susanlegetoej.dk/images/ddcss/ 450 SK113_Productmenu_NF.css 452 urn:pwid:netarkivet.dk:2008-11- 453 29T00:40:22Z:part:http://www.susanlegetoej.dk/images/ddcss/ 454 SK113_SpaceTop_NF.css 456 urn:pwid:netarkivet.dk:2008-11- 457 29T00:40:24Z:part:http://www.susanlegetoej.dk/images/ddcss/ 458 SK113_SpaceLeft_NF.css 460 urn:pwid:netarkivet.dk:2008-11- 461 29T00:40:23Z:part:http://www.susanlegetoej.dk/images/ddcss/ 462 SK113_SpaceBottom_NF.css 464 urn:pwid:netarkivet.dk:2008-11- 465 29T00:40:25Z:part:http://www.susanlegetoej.dk/images/ddcss/ 466 SK113_SpaceRight_NF.css 468 urn:pwid:netarkivet.dk:2008-11- 469 29T00:37:23Z:part:http://www.susanlegetoej.dk/images/ddcss/ 470 SK113_ProductInfo_NF.css 472 urn:pwid:netarkivet.dk:2008-11- 473 29T00:37:24Z:part:http://www.susanlegetoej.dk/Shop/js/ 474 Variants.js 476 urn:pwid:netarkivet.dk:2009-03- 477 03T11:53:00Z:part:http://www.susanlegetoej.dk/Shop/js/Media.js 478 urn:pwid:netarkivet.dk:2009-03- 479 03T11:53:02Z:part:http://www.susanlegetoej.dk/images/design/ 480 print.gif 482 urn:pwid:netarkivet.dk:2009-03- 483 03T11:54:19Z:part:http://www.susanlegetoej.dk/Shop/js/Scroll.js 485 urn:pwid:netarkivet.dk:2009-03- 486 03T11:54:09Z:part:http://www.susanlegetoej.dk/Shop/js/ 487 Shop5Common.js 489 urn:pwid:netarkivet.dk:2006-11- 490 20T20:16:03Z:part:http://www.susanlegetoej.dk/images/602551.jpg 492 On long term, there should be created a registry that keeps track 493 of identifiers of archives over time, since they are likely to 494 change names, merge etc. when taking about a 100 year period. 496 Security and Privacy: 498 Security and privacy considerations are restricted to accessible 499 web resources in web archives. If resolvers to PWID URNs are 500 created, there should be made an analysis of whether they can be 501 restricted to the former mentioned registry of web archives. 502 Security and privacy will then be a question of security and 503 privacy considerations related to the web archive resources. 505 Interoperability: 507 This is covered by comments in the Syntax description: 509 * the PWID URN conforms to the URI standard defined as in RFC 510 3986 [RFC3986] and the URN standard RFC 8141 [RFC8141] 512 * the 'archival-time' of the PWID URN conforms to the URI 513 standard defined as in RFC 3986 [RFC3986]W3C profile ISO 8601 514 [ISO8601] (also defined in RFC 3339 [RFC3339]) and to the WARC 515 standard ISO 28500 [ISO28500] using UTC dates only 517 * the 'archived-item' is a URI which conforms to the URI standard 518 defined as in RFC 3986 [RFC3986] 520 Resolution: 522 The information in a PWID URN can be used for locating a web 523 archive resource, for any kind of web archive. It includes the 524 minimum information for web archive materials, which enables 525 resolvability, manually or by a resolver. esolution of a PWID URN 526 is the primary motivation of making a formal URN definition, 527 instead of just textual representation of the for needed parts of 528 a PWID. 530 A resolving service is currently available in form of code for a 531 prototype which run at the Royal Danish Library [PWIDresolver] and 532 is planned to be more broudly available or can be installed 533 locally. This service currently covers bothe the Danish web 534 archives (with the proper rights) and open web archives with 535 access sevices based on a patterns including archive, archival 536 time and archived URI. In other words, for open web archives it 537 covers conversion of PWIDs for: archive.org, archive-it.org, 538 arquivo.pt, bibalex.org, nationalarchives.gov.uk, stanford.edu and 539 vefsafn.is. The source code for this prototyppe is available from 540 https://github.com/netarchivesuite/NAS-research/releases/ 541 tag/0.0.6. 543 Resolution (manually or automatically) is done based on the PWID 544 parts: 546 * Web archive identification 547 to find the archive holding the material 549 * Archived URI or identifier of item 550 as part of identifying the material 552 * Date and time associated with the archived URI/item 553 as part of precise identification of the material 555 * Coverage of what is referred 556 as part of clarification of what the referred material covers 557 (page, part etc.) 559 in the following the different resolution techniques are explained 560 (manual as well as via a service) An example of a PWID URN is: 562 urn:pwid:archive.org:2016-01-22T11:20:29Z:page:http://www.dr.dk 564 has the information: 566 * archive.org 567 currently known identifier in form of the Internet Archive 568 domian name for their open access web archive 570 * 2016-01-22T11:20:29Z 571 UTC date and time associated with the archived URI 573 * page 574 clarification that the reference cover the full web page with 575 all its inherited parts selected by the web archive 577 * http://www.dr.dk 578 archived URI of item 580 With knowledge of the current (2017) Internet Archive open access 581 web interface having the form: 583 https://web.archive.org/web/