idnits 2.17.1 draft-pwid-urn-specification-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (May 2, 2019) is 1821 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force E. Zierau, Ed. 3 Internet-Draft Royal Danish Library 4 Intended status: Informational May 2, 2019 5 Expires: November 3, 2019 7 A Persistent Web IDentifier (PWID) URN Namespace 8 draft-pwid-urn-specification-07 10 Abstract 12 This document specifies a Uniform Resource Name (URN) for Persistent 13 Web IDentifiers for web material in web archives using the 'pwid' 14 namespace identifier. 16 The main purpose of the standard is to support specification of 17 references that are not covered by other reference techniques: to 18 support references to material in web archives with restricted 19 access. Furthermore, it supports persistent technology agnostic 20 references to web archives in general, in a form that can work as an 21 algorithmic basis for finding web archive resources in general. An 22 additional important benefit is that the standard can be used for 23 specifying web collections, which can then form a persistent 24 computational basis for the extract of the archived collection parts. 25 Since these parts can be specified generally, this further allows 26 collections to be specified with elements from one or more web 27 archives. 29 The PWID URN is designed to meet requirements for proper referencing 30 needed by researchers. Therefore, it is designed as general, global, 31 sustainable, humanly readable, technology agnostic, persistent and 32 precise web references for web materials in web archives. 34 Status of This Memo 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at https://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on November 3, 2019. 50 Copyright Notice 52 Copyright (c) 2019 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 68 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 5 69 2. Namespace Registration Template . . . . . . . . . . . . . . . 6 70 3. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 22 71 4. References . . . . . . . . . . . . . . . . . . . . . . . . . 22 72 4.1. Normative References . . . . . . . . . . . . . . . . . . 22 73 4.2. Informative References . . . . . . . . . . . . . . . . . 22 74 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 25 76 1. Introduction 78 The PWID URN is a supplement to existing reference standards, where 79 the PWID URN will support references to web archives, including areas 80 that are not supported today: support of references to material in 81 web archives with restricted access. Furthermore, the PWID URN 82 enables technology agnostic references to web archives in general, 83 which can be needed, for instance for references to dynamic web 84 material with frequent updates (e.g. a news site) or a specific 85 version of a web material (e.g. specific version of the DOI 86 handbook). 88 The PWID URN is in a form which can work as an algorithmic basis for 89 finding the resource. This also enables computation of archived web 90 parts to a collection from one or more web archives, if the 91 collection parts are specified by PWID URNs. 93 Furthermore, the PWID URN includes information about the resource 94 which makes it possible to find alternative resources, in cases where 95 the original precise resource has become unavailable. 97 The PWID URN is designed to be a persistent reference that is 98 general, global and technology agnostic in order to enhance its 99 chances of being sustainable. Furthermore, it is designed to be 100 humanly readable and with an ability to specify precision about what 101 the referenced web archive resource covers. This design enables a 102 PWID URN to: 104 o be used in technical solutions, e.g. to make them resolvable 106 o cover references to all sorts of materials in web archives 108 o cover references to materials from all sorts of web archives 110 The motivation for defining a PWID namespace is the growing 111 challenges of references to archived web resources, and the PWID as a 112 URN can assist in overcoming a lot of these challenges. The standard 113 is needed to address web materials meeting precision and persistency 114 issues on par with precision in traditional references for analogue 115 material. Furthermore, it is needed in order to address web archive 116 resources that are not freely available online. The PWID URN covers 117 both referencing of web resources from research papers and definition 118 of web collections/corpora. In detail the challenges are: 120 o Persistent Identifier systems (like DOI [DOI]) will only cover 121 registered resources. In general, citation guidelines do not 122 cover general and persistent referencing techniques for web 123 resources that are not registered. However, an increasing number 124 of references point to resources that only exist on the web, e.g. 125 blogs that turn out to have a historical impact. In order to 126 obtain persistency for a reference, the target needs to be stable. 127 For non-registered web resources, the common rule is that the 128 resource will change, since the live-web is constantly changing. 129 Persistency can only be obtained by referring to something stable, 130 i.e. an archived snapshot of the resource from the web. The PWID 131 URN is therefore focused on referencing archived web material in a 132 technology agnostic way (research documented in [IPRES2016] and 133 [ResawRef]). 135 o References to materials, which only exist in web archives (i.e. no 136 longer on the live web) are not well supported, especially not for 137 materials that only exists in archives with restricted access. 138 There are many new initiatives for web archive referencing, - most 139 of which are centralized solutions offering harvesting and 140 referencing, but these cannot be used for materials that only 141 exist in web archives. The PWID URN can be used for all web 142 archives, including web archives with restricted access. 144 o One of the referencing initiatives for open web archives uses URLs 145 which depend on the current setup of the web archive's access 146 platform. These URLs are usually technology and placement 147 dependent, and therefore such a reference style is not suited for 148 references that are important to retrace for a long period. The 149 PWID URN can be used for such reference purposes, since it is 150 technology agnostic. 152 o Another referencing initiative, for open web archives, is omitting 153 specification of the web archive where the resource was found. 154 This strategy is used in order to open the possibility of using 155 alternatives from other archives. However, this also adds a risk 156 of imprecision since different archives tend to have different 157 versions even when harvesting at the same time. Therefore, such a 158 reference style is not suited for references where it is important 159 that the reference is precisely the verified reference. The PWID 160 URN can provide an exact reference for where the reference was 161 validated. Additionally, the PWID contains the needed information 162 in order to search for alternative resource, if needed. 164 o For reference of web collections/corpora (possibly across 165 different web archives), recent research have found that various 166 legal and sustainability issues has led to a need of a collection 167 definition of references to their web parts. Furthermore, there 168 is a need for a similar persistent referencing for all parts for 169 calculation and sustainability reasons. So far, there has been no 170 stable standard for definition of such collection parts. The PWID 171 URN can be used for such definitions in order to fulfil these 172 requirements (research documented in [ResawColl]). 174 The PWID URN is especially useful for web material where precision is 175 in focus and/or there are references to materials from web archives 176 requiring special permissions in order to gain access. The precision 177 regards both pointing to the archive where it was found and validated 178 against its purpose (other archived versions in other web archives 179 may differ both regarding completeness and contents even within short 180 time periods) as well as precision in what is actually referred by 181 the reference (e.g. is it the page or the whole website). 183 Furthermore, the PWID URN is very useful in specification of contents 184 of a web collection. Definitions of web collections are often needed 185 for extraction of data used in production of research results, e.g. 186 for future evaluations. Current practices are not persistent as they 187 often use some CDX version, which vary for different implementations. 189 Strict syntax is needed for the PWID URN, in order to ensure that it 190 can act as a reference which can used for computational purposes. 191 This is especially relevant for automatic extraction of parts from 192 web collection definitions. Furthermore, today's readers of research 193 papers are expecting to be able to access a referenced resource by 194 clicking an actionable URI, therefore a similar possibility will be 195 expected for references to available archived web material, and this 196 is possible with a strict syntax. Examples of technical solutions 197 that are enabled are: 199 o Resolving of a reference to a web collection and automatic 200 extraction of the parts of a web collection defined by PWID URNs 201 [ResawRef] [ResawColl] 203 o Resolving of a PWID URN by resolving services. To begin with, a 204 prototype has been developed for the Danish web archive data and 205 open web archives with standard patterns for the current 206 technologies. Implementations for resolution of PWID URNs for 207 other web archives may be developed. 209 The purpose of the PWID URN is also to express a web archive 210 reference as simple as possible and at the same time meet the 211 requirements for sustainability, usability and scope. Therefore, the 212 PWID URN is focused on having only the minimum required information 213 to make a precise identification of a resource in an arbitrary web 214 archive. Recent research have shown that this can be obtained by the 215 following information [ResawRef]: 217 o Identification of web archive 219 o Identification of source: 221 * Archived URI or identifier 223 * Archival timestamp 225 o Intended precision (page, part, subsite etc.) 227 The PWID URN represents this information in a human readable way as 228 well as a well-defined way that enables technical solutions to 229 interpret the URN. 231 1.1. Requirements Language 233 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 234 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 235 document are to be interpreted as described in [RFC2119]. 237 2. Namespace Registration Template 239 Namespace Identifier: 241 PWID 243 Version: 245 1 247 Date: 249 2019-05-02 251 Registrant: 253 Eld Maj-Britt Olmuetz Zierau 254 Royal Danish Library 255 Soeren Kierkegaards Plads 1 256 1219 Copenhagen 257 Denmark 258 ph: +45 9132 4690 259 email: elzi@kb.dk 261 Purpose: 263 The PWID URN is a supplement to existing reference standards, 264 where the PWID URN will support references to web archives, 265 including areas that are not supported today: support of 266 references to material in web archives with restricted access. 267 Furthermore, the PWID URN enables technology agnostic references 268 to web archives in general, which can be needed, for instance for 269 references to dynamic web material with frequent updates (e.g. a 270 news site) or a specific version of a web material (e.g. specific 271 version of the DOI handbook). 273 The PWID URN is in a form which can work as an algorithmic basis 274 for finding the resource. This also enables computation of 275 archived web parts to a collection from one or more web archives, 276 if the collection parts are specified by PWID URNs. 278 Furthermore, the PWID URN includes information about the resource 279 which makes it possible to find alternative resources, in cases 280 where the original precise resource has become unavailable. 282 The PWID URN is designed to be a persistent reference that is 283 general, global and technology agnostic in order to enhance its 284 chances of being sustainable. Furthermore, it is designed to be 285 humanly readable and with an ability to specify precision about 286 what the referenced web archive resource covers. This design 287 enables a PWID URN to: 289 * be used in technical solutions, e.g. to make them resolvable 291 * cover references to all sorts of materials in web archives 293 * cover references to materials from all sorts of web archives 295 The motivation for defining a PWID namespace is the growing 296 challenges of references to archived web resources, and the PWID 297 as a URN can assist in overcoming a lot of these challenges. The 298 standard is needed to address web materials meeting precision and 299 persistency issues on par with precision in traditional references 300 for analogue material. Furthermore, it is needed in order to 301 address web archive resources that are not freely available 302 online. The PWID URN covers both referencing of web resources 303 from research papers and definition of web collections/corpora. 304 In detail the challenges are: 306 * Persistent Identifier systems (like DOI [DOI]) will only cover 307 registered resources. In general, citation guidelines do not 308 cover general and persistent referencing techniques for web 309 resources that are not registered. However, an increasing 310 number of references point to resources that only exist on the 311 web, e.g. blogs that turn out to have a historical impact. In 312 order to obtain persistency for a reference, the target needs 313 to be stable. For non-registered web resources, the common 314 rule is that the resource will change, since the live-web is 315 constantly changing. Persistency can only be obtained by 316 referring to something stable, i.e. an archived snapshot of the 317 resource from the web. The PWID URN is therefore focused on 318 referencing archived web material in a technology agnostic way 319 (research documented in [IPRES2016] and [ResawRef]). 321 * References to materials, which only exist in web archives (i.e. 322 no longer on the live web) are not well supported, especially 323 not for materials that only exists in archives with restricted 324 access. There are many new initiatives for web archive 325 referencing, - most of which are centralized solutions offering 326 harvesting and referencing, but these cannot be used for 327 materials that only exist in web archives. The PWID URN can be 328 used for all web archives, including web archives with 329 restricted access. 331 * One of the referencing initiatives for open web archives uses 332 URLs which depend on the current setup of the web archive's 333 access platform. These URLs are usually technology and 334 placement dependent, and therefore such a reference style is 335 not suited for references that are important to retrace for a 336 long period. The PWID URN can be used for such reference 337 purposes, since it is technology agnostic. 339 * Another referencing initiative, for open web archives, is 340 omitting specification of the web archive where the resource 341 was found. This strategy is used in order to open the 342 possibility of using alternatives from other archives. 343 However, this also adds a risk of imprecision since different 344 archives tend to have different versions even when harvesting 345 at the same time. Therefore, such a reference style is not 346 suited for references where it is important that the reference 347 is precisely the verified reference. The PWID URN can provide 348 an exact reference for where the reference was validated. 349 Additionally, the PWID contains the needed information in order 350 to search for alternative resource, if needed. 352 * For reference of web collections/corpora (possibly across 353 different web archives), recent research have found that 354 various legal and sustainability issues has led to a need of a 355 collection definition of references to their web parts. 356 Furthermore, there is a need for a similar persistent 357 referencing for all parts for calculation and sustainability 358 reasons. So far, there has been no stable standard for 359 definition of such collection parts. The PWID URN can be used 360 for such definitions in order to fulfil these requirements 361 (research documented in [ResawColl]). 363 The PWID URN is especially useful for web material where precision 364 is in focus and/or there are references to materials from web 365 archives requiring special permissions in order to gain access. 366 The precision regards both pointing to the archive where it was 367 found and validated against its purpose (other archived versions 368 in other web archives may differ both regarding completeness and 369 contents even within short time periods) as well as precision in 370 what is actually referred by the reference (e.g. is it the page or 371 the whole website). 373 Furthermore, the PWID URN is very useful in specification of 374 contents of a web collection. Definitions of web collections are 375 often needed for extraction of data used in production of research 376 results, e.g. for future evaluations. Current practices are not 377 persistent as they often use some CDX version, which vary for 378 different implementations. 380 Strict syntax is needed for the PWID URN, in order to ensure that 381 it can act as a reference which can used for computational 382 purposes. This is especially relevant for automatic extraction of 383 parts from web collection definitions. Furthermore, today's 384 readers of research papers are expecting to be able to access a 385 referenced resource by clicking an actionable URI, therefore a 386 similar possibility will be expected for references to available 387 archived web material, and this is possible with a strict syntax. 388 Examples of technical solutions that are enabled are: 390 * Resolving of a reference to a web collection and automatic 391 extraction of the parts of a web collection defined by PWID 392 URNs [ResawRef] [ResawColl] 394 * Resolving of a PWID URN by resolving services. To begin with, 395 a prototype has been developed for the Danish web archive data 396 and open web archives with standard patterns for the current 397 technologies. Implementations for resolution of PWID URNs for 398 other web archives may be developed. 400 The purpose of the PWID URN is also to express a web archive 401 reference as simple as possible and at the same time meet the 402 requirements for sustainability, usability and scope. Therefore, 403 the PWID URN is focused on having only the minimum required 404 information to make a precise identification of a resource in an 405 arbitrary web archive. Recent research have shown that this can 406 be obtained by the following information [ResawRef]: 408 * Identification of web archive 410 * Identification of source: 412 + Archived URI or identifier 414 + Archival timestamp 416 * Intended precision (page, part, subsite etc.) 418 The PWID URN represents this information in a human readable way 419 as well as a well-defined way that enables technical solutions to 420 interpret the URN. 422 Syntax: 424 The syntax of the PWID URN is specified below in Augmented Backus- 425 Naur Form (ABNF) [RFC5234] and conforms to URN syntax defined in 426 [RFC8141]. The syntax definition of the PWID URN is: 428 pwid-urn = "urn:" pwid-NID ":" pwid-NSS 430 pwid-NID = "pwid" 431 pwid-NSS = archive-id ":" archival-time ":" precision-spec 432 ":" archived-item-id 434 archive-id = domain / ( "~" registered-archive-id ) 435 registered-archive-id = 1*unreserved 437 archival-time = utc-date ["T" utc-time] "Z" 438 utc-date = utc-year "-" utc-month "-" utc-day 439 utc-year = 4DIGIT 440 utc-month = 2DIGIT ; 01-12 441 utc-day = 2DIGIT ; 01-28, 01-29, 01-30, 01-31 based on 442 ; month/year in UTC time 443 utc-time = utc-hour ":" utc-minute [":" utc-second [secfrac]] 444 utc-hour = 2DIGIT ; 00-23 445 utc-minute = 2DIGIT ; 00-59 446 utc-second = 2DIGIT ; 00-58, 00-59, 00-60 based on leap second 447 ; rules 448 secfrac = "." 1*9DIGIT 450 precision-spec = "part" / "page" / "subsite" / "site" 451 / "collection" / "recording" / "snapshot" 452 / extension 453 extension = 1*ALPHA 455 archived-item-id = uri-string / ( "~" registered-item-id ) 456 registered-item-id = 1*unreserved 458 where 460 * All parts of the pwid-NSS are case insensitive, except for 461 archived-item-id in cases where the archived-item-id is an URI 462 with case sensitive parts. According to [RFC8141] (section 463 3.1) this means that the PWID URNs in general are case 464 insensitive, except from cases where it includes a case 465 sensitive URI as archived-item-id. 467 * 'DIGIT' is defined as in [RFC5234]. 469 * 'ALPHA' is defined as in [RFC5234]. 471 * 'unreserved' is defined as in [RFC3986]. 473 * 'domain' is defined as in (section section 3.5) [RFC1034]. 475 * 'uri-string' is defined as 'URI' in [RFC3986] but where 476 occurrences of "[", "]", "?", "#" and "%" are %-encoded in 477 order not to clash with URN reserved characters [RFC8141] as 478 well as having unambiguous use of "%". 480 * 'archive-id' must either be the domain for the archive which 481 can lead to descriptions of how to access (or apply for access) 482 materials in he archive, - or it must be a registered archive- 483 id (registry still to be defined and created). Distinction 484 between the to types of identifiers is made by matching the 485 first character with "~". In case of a match, it means that 486 the rest of the identifier is a registered archive item 487 identifiers, since the syntax requires such identifiers to be 488 prefixed with "~", while no URI is allowed to start with this 489 character 491 * 'archival-time' is a UTC timestamp which conforms to the W3C 492 profile of [ISO8601] [W3CDTF] and a subset of date-time 493 specified in [RFC3339] (except from allowing partial time 494 specification). The archival-time may be specified at any of 495 the levels of granularity, as long as it reflects exactly the 496 granularity of the timestamp recorded in the archive (which is 497 in accordance with the WARC standard [ISO28500]). 499 * 'archive-item-id' must either be the archived URI for the 500 source or a registered archive-item-id. Distinction between 501 the to types of identifiers is made by matching the first 502 character with "~". In case of a match, it means that the rest 503 of the identifier is a registered archive item identifiers, 504 since the syntax requires such identifiers to be prefixed with 505 "~", while no URI is allowed to start with this character 507 The precision specification is expressing the intended precision 508 of the reference, which is needed for specification of 510 * precise coverage of the reference 511 e.g. to an html file, since the precise meaning of what the 512 reference covers can be very varied (the html file itself?, the 513 web page it renders to? a subsite that it represents? etc.) or 514 a collection of web parts with precise sepcification of which 515 web parts that are included. 517 * degree of how precise the reference is with respect to what can 518 be viewed in the future 519 the html file itself will probably be the same and a collection 520 specification is also quite precise (see also example of a 521 collection is provided in the section about assignment). 522 However, for web pages and websites there are interpretation 523 involved, which mean the result of rendering them in the web 524 archive can change over time. This may happen in case the web 525 archive's algorithm for calculation of the referred web parts 526 or additional parts have been added which will be picked by the 527 algorithm. 529 * how resolvers should display the referred source in order to 530 correspond to the precision specification 531 if it is an html file it can e.g. be as a text file if the 532 preceision specification is 'part', via Wayback if the 533 preceision specification is "webpage", choosing a snapshot 534 version of the page if the the preceision specification is 535 'snapshot' etc. If access is limited to the referenced part 536 (the html page), then the application would also need to make 537 sure that all parts/pages belonging to the site/subsite is 538 available. 540 The following valid precision-spec values are exists: 542 * 'part' 543 Meaning the single archived file/web part harvested as from the 544 specified URI. In case the URI is for a file for a web page. 545 For refences to web pages with html code (e.g. html or asp 546 file) this will mean the actual file with the html code. It is 547 relevant to refer to web pages this way, in case it is part of 548 a collection specification or in case it is the html that is of 549 interest (e.g. javascripts or hidden links which are not 550 visible when rendering the web page). 551 For all other types of files the URI will be for single files 552 to be interpreted a file. The precision-spec will always be 553 'part' for such single non-web related files. 555 * 'page' 556 Meaning that an application like Wayback calculates a resulting 557 web page based on calculated referenced web parts (display 558 templates, images etc.). For example an html page displaying 559 an image will need both the html and the referred image. If it 560 is important for the reference to be sure that it is the same 561 image, the most a precise reference to a picture in context of 562 a web page would be to provide the PWID URN for the page (with 563 webpage precision) and the PWID URN for the image file part 564 which contains the referred picture (with part precision) 566 * 'subsite' 567 Meaning the subsite defined by the referred web page (as 568 described under 'page') and underlying web pages (referred by 569 the page) that have URIs starting with the same path as the 570 archived URI. Calculatiuon of all the relevant parts for the 571 subsite are calculated by an application like Wayback 572 (similarly to what is described under 'page'). 574 * 'site' 575 Meaning the site defined by the referred web page (as described 576 under 'page') and underlying web pages (referred by the page) 577 that have URIs starting with the same path as the archived URI. 578 Calculatiuon of all the relevant parts for the site are 579 calculated by an application like Wayback (similarly to what is 580 described under 'page'). 582 * 'collection' 583 Meaning the collection which is defined in a specification 584 which can be identified by an identifier (e.g. collection 585 specification in the XML format enabling interpretation as in 586 the example provided in [ResawColl]) 588 * 'snapshot' 589 Meaning a snapshot (image) representation of web material, e.g. 590 a web page 592 * 'recording' 593 Meaning a representation of a web recording specification where 594 the web archive applications will decide how it is rendered 595 (interpretation could e.g. depend on file-suffix for the web 596 recording), an example is a web recording coded in a WARC file 598 The option of making an extension value is included to allow 599 reference of a resource of any kind with an assigned identifier, 600 even if it is not covered by the other values. In all cases, it 601 will be up to the application serving the web archive to interpret 602 how this item should be rendered. 604 Assignment: 606 The PWID URNs do not have to be assigned by an authority, as they 607 are based on the information created at the time of archiving. In 608 other words: a PWID URN is created independently, but following an 609 algorithm which ensures that the referred item can be found if it 610 is still available. A PWID URN also has the benefit that it 611 includes information to look at alternative resources e.g. via 612 Memento for some open web archives [MEMENTO] or via possible 613 future web archive infrastructures. 615 A PWID URN is created by finding the relevant information of the 616 syntax parts of the PWID: 618 "urn:pwid:" archive-id ":" archival-time ":" precision-spec 619 ":" archived-item-id 621 The PWID URN for an archived item at hand can be constructed by 622 exchanging the unspecified PWID parts with relevant information, 623 as explained in the following: 625 * archive-id (identification of web archive): 626 In this version of the standard, it is recommended to use the 627 domain of the web archive as the identifier for the web archive 628 (e.g. archive.org for Internet Archive's open web archive and 629 netarkivet.dk for the Danish web archive with restricted 630 access). This is recommended, since browsing the domain page 631 will typically lead to a description of how to access the web 632 archive, e.g. by online access or by applying for access 633 grants. Furthermore, it is more precise than e.g. the name of 634 the archive, since there may be more than one installation of 635 web archives at the same organization, e.g. archive.org and 636 archive-it.org are both covered by Internet Archive. 637 When a registry of web archives is established, it will be more 638 precise and persistent to use the web archive identifier 639 specified in this registry. (e.g. DKWA for the Danish web 640 archive with the domain netarkivet.dk). The syntax requires 641 that such identifiers are prefixed with the character "~". 643 * archival-time (archival timestamp): 644 The archival time for the archived item must be specified with 645 as much granularity as possible in order to make sure it 646 uniquely identifies the resource at hand. The archival time 647 may be displayed along with the archived item, but there are 648 different implementations where it is important to be aware of 649 whether a more precise timestamp can be found, and whether the 650 correct timestamp is used. In many Wayback implementations, 651 the precise timestamp can be found as part of the URI used for 652 viewing the archived item. For example, the archive http URI 653 https://web.archive.org/web/20160122112029/http://www.dr.dk for 654 an archived resource viewable via the Internet Archive's 655 Wayback installation, the number 20160122112029 represents the 656 archival time 2016-01-22T11:20:29Z. In other installations, 657 the most precise timestamp may be found in the URI from a 658 search result leading to the resource (which usually redirects 659 on basis of a call to the underlying archive index). 660 Especially for web pages with frames, there may be cases where 661 the actual time is not displayed with the source, since only 662 the times for the contents of the frames are displayed. 664 * precision-spec (precision as represented page, part, site, 665 snapshot etc.): 667 The precision specification specifies how the user should view 668 the referred item - either as a specific representation (with 669 inherited precision) or by use of tools (e.g. browse web site 670 based on calculations or browse on basis of collection of 671 specific parts). 672 Inherited precision is implicitly indicated by the precision 673 specification from how the information is used in resolution 674 and location. The most precis reference is part, e.g. for an 675 image which can be located and accessed independently. Less 676 precise references are references where calculation of other 677 parts are needed in order to resolve and view it, e.g. page, 678 site or subsite. 680 * archived-item-id (archived URI or registered identifier): 681 The archived item identifier will either be the archived URI of 682 the displayed archived item at hand, or it will be an 683 identifier assigned for a resource by the archive. In the 684 latter case, the syntax requires that such identifiers are 685 prefixed with the character "~". 687 A much easier way to construct PWID URNs is to use tools that 688 construct them. Currently, there is also a prototype for a SOLR- 689 Wayback tool (Source at https://github.com/netarchivesuite/ 690 solrwayback) [PWIDprovider], which can assist in finding the most 691 precise reference to an archived web page. This Wayback version 692 can provide all PWID URNs belonging to a shown page (with the page 693 PWID URN at the top). For example, in netarkivet.dk, the archived 694 URI for the web page http://www.susanlegetoej.dk/shop/handskedyr- 695 siameser-killing-8681p.html archived 2008-11-29 01:19:16 UTC, has 696 the following parts calculated by the SOLR-Wayback tool: 698 urn:pwid:netarkivet.dk:2008-11- 699 29T00:41:42Z:part:http://www.susanlegetoej.dk/images/ddcss/ 700 SK113_Master_NF.css 702 urn:pwid:netarkivet.dk:2008-11- 703 29T00:39:47Z:part:http://www.susanlegetoej.dk/shop/css/ 704 print.css 706 urn:pwid:netarkivet.dk:2008-11- 707 29T00:40:06Z:part:http://www.susanlegetoej.dk/images/ddcss/ 708 SK113_Basket_NF.css 710 urn:pwid:netarkivet.dk:2008-11- 711 29T00:40:00Z:part:http://www.susanlegetoej.dk/images/ddcss/ 712 SK113_TopMenu_NF.css 713 urn:pwid:netarkivet.dk:2008-11- 714 29T00:40:00Z:part:http://www.susanlegetoej.dk/images/ddcss/ 715 SK113_SearchPage_NF.css 717 urn:pwid:netarkivet.dk:2008-11- 718 29T00:40:35Z:part:http://www.susanlegetoej.dk/images/ddcss/ 719 SK113_Productmenu_NF.css 721 urn:pwid:netarkivet.dk:2008-11- 722 29T00:40:22Z:part:http://www.susanlegetoej.dk/images/ddcss/ 723 SK113_SpaceTop_NF.css 725 urn:pwid:netarkivet.dk:2008-11- 726 29T00:40:24Z:part:http://www.susanlegetoej.dk/images/ddcss/ 727 SK113_SpaceLeft_NF.css 729 urn:pwid:netarkivet.dk:2008-11- 730 29T00:40:23Z:part:http://www.susanlegetoej.dk/images/ddcss/ 731 SK113_SpaceBottom_NF.css 733 urn:pwid:netarkivet.dk:2008-11- 734 29T00:40:25Z:part:http://www.susanlegetoej.dk/images/ddcss/ 735 SK113_SpaceRight_NF.css 737 urn:pwid:netarkivet.dk:2008-11- 738 29T00:37:23Z:part:http://www.susanlegetoej.dk/images/ddcss/ 739 SK113_ProductInfo_NF.css 741 urn:pwid:netarkivet.dk:2008-11- 742 29T00:37:24Z:part:http://www.susanlegetoej.dk/Shop/js/ 743 Variants.js 745 urn:pwid:netarkivet.dk:2009-03- 746 03T11:53:00Z:part:http://www.susanlegetoej.dk/Shop/js/Media.js 748 urn:pwid:netarkivet.dk:2009-03- 749 03T11:53:02Z:part:http://www.susanlegetoej.dk/images/design/ 750 print.gif 752 urn:pwid:netarkivet.dk:2009-03- 753 03T11:54:19Z:part:http://www.susanlegetoej.dk/Shop/js/Scroll.js 755 urn:pwid:netarkivet.dk:2009-03- 756 03T11:54:09Z:part:http://www.susanlegetoej.dk/Shop/js/ 757 Shop5Common.js 759 urn:pwid:netarkivet.dk:2006-11- 760 20T20:16:03Z:part:http://www.susanlegetoej.dk/images/602551.jpg 762 Security and Privacy: 764 Security and privacy considerations are restricted to accessible 765 web resources in web archives. Resolvers to PWID URNs will 766 usually only be possible using the web archives' access tools, 767 where security and privacy are covered by these tools. In such 768 cases security and privacy will be as covered by these tools. It 769 should be noted that an archived web page or part can be just as 770 dangerous as a "live" page or part; for instance, it could include 771 insecure scripts, malware, trackers, etc. Furthermore, an 772 archived page can in fact be more dangerous, because it could 773 include outdated scripts with known vulnerabilities that can never 774 be patched because the script is archived for all time in a 775 vulnerable state. 777 Interoperability: 779 This is covered by comments in the Syntax description: 781 * the PWID URN conforms to the URI standard defined as in 782 [RFC3986] and the URN standard [RFC8141] 784 * the 'archival-time' of the PWID URN conforms to the UTC 785 timestamp as described in the W3C profile of ISO 8601 [ISO8601] 786 [W3CDTF] and is in accordance with the WARC standard ISO 28500 787 [ISO28500]. 789 * for use of URIs for the 'archived-item-id', this URI conforms 790 to the URI standard defined as in [RFC3986], with %-encodings 791 of "[", "]", "#", "?" and "%" in order to conform to the URN 792 standard [RFC8141] as well as having unambiguous use of "%" 794 Resolution: 796 The information in a PWID URN can be used for locating a web 797 archive resource, for any kind of web archive. It includes the 798 minimum information for web archive materials, which enables 799 resolvability, manually or by a resolver. Resolution of a PWID 800 URN is the primary motivation of making a formal URN definition, 801 instead of just textual representation of the for needed parts of 802 a PWID. 804 Resolution (manually or automatically) is done based on the PWID 805 parts: 807 * Web archive identification for web archive holding referred 808 resource 809 If the identifier do not start with "~", then the identifier is 810 the domain name for the web archive, where browsing this domain 811 page typically will lead to description of how to access the 812 web archive. For example, "archive.org" is the domain name 813 leading to the Internet Archive's interface to their online web 814 collection, and "netarkivet.dk" is the domain name leading to 815 the website for the Danish web archive with information about 816 how to apply for access permission to the web collections. If 817 the identifier starts with "~" the archive can be identified by 818 looking up the identifier (from the rest of the archive 819 identier) in a registry of archives. It is a future 820 possibility is to have such a registry which should have 821 archive identifiers along with their current location on the 822 internet. Such a registry will be needed for persistent 823 reference to the archive, since an archive may change their 824 location and name or archives may merge. There is work in 825 progress to define such a registry, but no details yet. 827 * Archived URI or registered identifier of archived item 828 If the identifier do not start with "~", then the resource is 829 an archived URI, this URI must be used in search for or 830 construction of location of the resource. If the identifier 831 starts with "~", then the rest of the characters in the 832 identifier constitutes a registered identifier assigned to the 833 resource (by the archive), it is then this identifier that must 834 be used in search for or construction of location of the 835 resource 837 * Date and time associated with the archived item 838 The archival date and time must be used in search for or 839 construction of the location of the resource 841 * Precision of what is referred 842 The precision can either contribute to the guidance of 843 activating tools to view the referred item e.g. browse the 844 referred item as a page on basis of computed closest past, 845 browse the referred item on basis of parts specified in a 846 collection, or view the referred item as a snapshot. In the 847 example of the snapshot, it also contains a specification of 848 which resource to display 850 In the following the different resolution techniques are explained 851 (manual as well as via a service) . 853 An example of a PWID URN is: 855 urn:pwid:archive.org:2016-01-22T11:20:29Z:page:http://www.dr.dk 857 has the information: 859 * archive.org 860 Currently known identifier in form of the Internet Archive 861 domain name for their open access web archive. If Internet 862 Archive registered their open web archive in an IANA web 863 archive register, this identifier could currently be 864 "web.archive.org/web/" for Wayback resolution, or it could be 865 "archive.org/pwid/" if a PWID interface was created as 866 described below 868 * 2016-01-22T11:20:29Z 869 UTC date and time associated with the archived URI 871 * page 872 Clarification that the reference cover the full web page with 873 all its inherited parts selected by the web archive 875 * http://www.dr.dk 876 archived URI of item 878 Resolution of this URN PWID can be deduced based on the current 879 (2019) knowledge of Internet Archive's open Wayback access web 880 interface, which has the pattern: 882 https://web.archive.org/web/