idnits 2.17.1 draft-kunze-ark-27.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([Qualifier]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1794 has weird spacing: '... regexp repla...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 21, 2021) is 1159 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'Qualifier' is mentioned on line 523, but not defined ** Obsolete normative reference: RFC 2141 (Obsoleted by RFC 8141) ** Obsolete normative reference: RFC 2611 (Obsoleted by RFC 3406) ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Obsolete normative reference: RFC 2822 (Obsoleted by RFC 5322) ** Obsolete normative reference: RFC 2915 (Obsoleted by RFC 3401, RFC 3402, RFC 3403, RFC 3404) Summary: 8 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Kunze 3 Internet-Draft California Digital Library 4 Intended status: Informational E. Bermes 5 Expires: August 25, 2021 Bibliotheque nationale de France 6 February 21, 2021 8 The ARK Identifier Scheme 9 draft-kunze-ark-27 11 Abstract 13 The ARK (Archival Resource Key) naming scheme is designed to 14 facilitate the high-quality and persistent identification of 15 information objects. A founding principle of the ARK is that 16 persistence is purely a matter of service and is neither inherent in 17 an object nor conferred on it by a particular naming syntax. The 18 best that an identifier can do is to lead users to the services that 19 support robust reference. The term ARK itself refers both to the 20 scheme and to any single identifier that conforms to it. An ARK has 21 five components: 23 [https://NMA/]ark:[/]NAAN/Name[Qualifier] 25 an optional and mutable Name Mapping Authority (usually a hostname), 26 the "ark:" label, the Name Assigning Authority Number (NAAN), the 27 assigned Name, and an optional and possibly mutable Qualifier 28 supported by the NMA. The NAAN and Name together form the immutable 29 persistent identifier for the object independent of the URL hostname. 30 An ARK is a special kind of URL that connects users to three things: 31 the named object, its metadata, and the provider's promise about its 32 persistence. When entered into the location field of a Web browser, 33 the ARK leads the user to the named object. That same ARK, inflected 34 by appending `?info', returns a metadata record that is both human- 35 and machine-readable. The returned record contains core metadata and 36 a commitment statement from the current provider. Tools exist for 37 minting, binding, and resolving ARKs. 39 Status of This Memo 41 This Internet-Draft is submitted in full conformance with the 42 provisions of BCP 78 and BCP 79. 44 Internet-Drafts are working documents of the Internet Engineering 45 Task Force (IETF). Note that other groups may also distribute 46 working documents as Internet-Drafts. The list of current Internet- 47 Drafts is at https://datatracker.ietf.org/drafts/current/. 49 Internet-Drafts are draft documents valid for a maximum of six months 50 and may be updated, replaced, or obsoleted by other documents at any 51 time. It is inappropriate to use Internet-Drafts as reference 52 material or to cite them other than as "work in progress." 54 This Internet-Draft will expire on August 25, 2021. 56 Copyright Notice 58 Copyright (c) 2021 IETF Trust and the persons identified as the 59 document authors. All rights reserved. 61 This document is subject to BCP 78 and the IETF Trust's Legal 62 Provisions Relating to IETF Documents 63 (https://trustee.ietf.org/license-info) in effect on the date of 64 publication of this document. Please review these documents 65 carefully, as they describe your rights and restrictions with respect 66 to this document. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 71 1.1. Reasons to Use ARKs . . . . . . . . . . . . . . . . . . . 4 72 1.2. Three Requirements of ARKs . . . . . . . . . . . . . . . 5 73 1.3. Organizing Support for ARKs: Our Stuff vs. Their Stuff . 6 74 1.4. Definition of Identifier . . . . . . . . . . . . . . . . 8 75 2. ARK Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . 9 76 2.1. The Name Mapping Authority (NMA) . . . . . . . . . . . . 9 77 2.2. The ARK Label Part (ark:) . . . . . . . . . . . . . . . . 11 78 2.3. The Name Assigning Authority Number (NAAN) . . . . . . . 11 79 2.4. The Name Part . . . . . . . . . . . . . . . . . . . . . . 12 80 2.4.1. Optional: Shoulder and Blade . . . . . . . . . . . . 13 81 2.5. The Qualifier Part . . . . . . . . . . . . . . . . . . . 14 82 2.5.1. ARKs that Reveal Object Hierarchy . . . . . . . . . . 15 83 2.5.2. ARKs that Reveal Object Variants . . . . . . . . . . 16 84 2.6. Character Repertoires . . . . . . . . . . . . . . . . . . 18 85 2.7. Normalization and Lexical Equivalence . . . . . . . . . . 19 86 3. Naming Considerations . . . . . . . . . . . . . . . . . . . . 20 87 3.1. ARKS Embedded in Language . . . . . . . . . . . . . . . . 20 88 3.2. Objects Should Wear Their Identifiers . . . . . . . . . . 21 89 3.3. Names are Political, not Technological . . . . . . . . . 21 90 3.4. Choosing a Hostname or NMA . . . . . . . . . . . . . . . 21 91 3.5. Assigners of ARKs . . . . . . . . . . . . . . . . . . . . 23 92 3.6. NAAN Namespace Management . . . . . . . . . . . . . . . . 24 93 3.7. Sub-Object Naming . . . . . . . . . . . . . . . . . . . . 25 94 4. Finding a Name Mapping Authority . . . . . . . . . . . . . . 25 95 4.1. Looking Up NMAs in a Globally Accessible File . . . . . . 27 96 5. Generic ARK Service Definition . . . . . . . . . . . . . . . 27 97 5.1. Generic ARK Access Service (access, location) . . . . . . 27 98 5.1.1. Generic Policy Service (permanence, naming, etc.) . . 28 99 5.1.2. Generic Description Service . . . . . . . . . . . . . 30 100 5.2. Overview of The HTTP URL Mapping Protocol (THUMP) . . . . 30 101 5.3. The Electronic Resource Citation (ERC) . . . . . . . . . 33 102 5.4. Advice to Web Clients . . . . . . . . . . . . . . . . . . 34 103 5.5. Security Considerations . . . . . . . . . . . . . . . . . 35 104 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 35 105 Appendix A. ARK Maintenance Agency: arks.org . . . . . . . . . . 38 106 Appendix B. Looking up NMAs Distributed via DNS . . . . . . . . 38 107 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 41 109 1. Introduction 111 [ Note about this transitional draft. The ARKsInTheOpen.org 112 Technical Working Group (https://wiki.duraspace.org/display/ARKs/ 113 Technical+Working+Group) is in the process of revising the ARK spec 114 via a series of Internet-Drafts. This draft contains many minor but 115 noisy changes (lots of diffs but not much real change). While the 116 spec is in transition, new implementors should follow 117 https://datatracker.ietf.org/doc/html/draft-kunze-ark-18. ] 119 This document describes a scheme for the high-quality naming of 120 information resources. The scheme, called the Archival Resource Key 121 (ARK), is well suited to long-term access and identification of any 122 information resources that accommodate reasonably regular electronic 123 description. This includes digital documents, databases, software, 124 and websites, as well as physical objects (books, bones, statues, 125 etc.) and intangible objects (chemicals, diseases, vocabulary terms, 126 performances). Hereafter the term "object" refers to an information 127 resource. The term ARK itself refers both to the scheme and to any 128 single identifier that conforms to it. A reasonably concise and 129 accessible overview and rationale for the scheme is available at 130 [ARK]. 132 Schemes for persistent identification of network-accessible objects 133 are not new. In the early 1990's, the design of the Uniform Resource 134 Name [RFC2141] responded to the observed failure rate of URLs by 135 articulating an indirect, non-hostname-based naming scheme and the 136 need for responsible name management. Meanwhile, promoters of the 137 Digital Object Identifier [DOI] succeeded in building a community of 138 providers around a mature software system [Handle] that supports name 139 management. The Persistent Uniform Resource Locator [PURL] was 140 another scheme that had the advantage of working with unmodified web 141 browsers. ARKs represent an approach that attempts to build on the 142 strengths and to avoid the weaknesses of these schemes. 144 A founding principle of the ARK is that persistence is purely a 145 matter of service. Persistence is neither inherent in an object nor 146 conferred on it by a particular naming syntax. Nor is the technique 147 of name indirection -- upon which URNs, Handles, DOIs, and PURLs are 148 founded -- of central importance. Name indirection is an ancient and 149 well-understood practice; new mechanisms for it keep appearing and 150 distracting practitioner attention, with the Domain Name System (DNS) 151 [RFC1034] being a particularly dazzling and elegant example. What is 152 often forgotten is that maintenance of an indirection table is an 153 unavoidable cost to the organization providing persistence, and that 154 cost is equivalent across naming schemes. That indirection has 155 always been a native part of the web while being so lightly utilized 156 for the persistence of web-based objects indicates how unsuited most 157 organizations will probably be to the task of table maintenance and 158 to the much more fundamental challenge of keeping the objects 159 themselves viable. 161 Persistence is achieved through a provider's successful stewardship 162 of objects and their identifiers. The highest level of persistence 163 will be reinforced by a provider's robust contingency, redundancy, 164 and succession strategies. It is further safeguarded to the extent 165 that a provider's mission is shielded from funding and political 166 instabilities. These are by far the major challenges confronting 167 persistence providers, and no identifier scheme has any direct impact 168 on them. In fact, some schemes may actually be liabilities for 169 persistence because they create short- and long-term dependencies for 170 every object access on complex, special-purpose infrastructures, 171 parts of which are proprietary and all of which increase the carry- 172 forward burden for the preservation community. It is for this reason 173 that the ARK scheme relies only on educated name assignment and light 174 use of general-purpose infrastructures that are maintained mostly by 175 the internet community at large (the DNS, web servers, and web 176 browsers). 178 1.1. Reasons to Use ARKs 180 If no persistent identifier scheme contributes directly to 181 persistence, why not just use URLs? A particular URL may be as 182 durable an identifier as it is possible to have, but nothing 183 distinguishes it from an ordinary URL to the recipient who is 184 wondering if it is suitable for long-term reference. An ARK embedded 185 in a URL provides some of the necessary conditions for credible 186 persistence, inviting access to not one, but to three things: to the 187 object, to its metadata, and to a nuanced statement of commitment 188 from the provider in question (the NMA, described below) regarding 189 the object. Existence of the extra service can be probed 190 automatically by appending `?info' to the ARK. 192 The form of the ARK also supports the natural separation of naming 193 authorities into the original name assigning authority and the 194 diverse multiple name mapping (or servicing) authorities that in 195 succession and in parallel will take over custodial responsibilities 196 from the original assigner (assuming the assigner ever held that 197 responsibility) for the large majority of a long-term object's 198 archival lifetime. The name mapping authority, indicated by the 199 hostname part of the URL that contains the ARK, serves to launch the 200 ARK into cyberspace. Should it ever fail (and there is no reason why 201 a well-chosen hostname for a 100-year-old cultural memory institution 202 shouldn't last as long as the DNS), that host name is considered 203 disposeable and replaceable. Again, the form of the ARK helps 204 because it defines exactly how to recover the core immutable object 205 identity, and simple algorithms (one based on the URN model) or even 206 by-hand internet query can be used for for locating another mapping 207 authority. 209 There are tools to assist in generating ARKs and other identifiers, 210 such as [NOID] and "uuidgen", both of which rely for uniqueness on 211 human-maintained registries. This document also contains some 212 guidelines and considerations for managing namespaces and choosing 213 hostnames with persistence in mind. 215 1.2. Three Requirements of ARKs 217 The first requirement of an ARK is to give users a link from an 218 object to a promise of stewardship for it. That promise is a multi- 219 faceted covenant that binds the word of an identified service 220 provider to a specific set of responsibilities. It is critical for 221 the promise to come from a current provider and almost irrelevant, 222 over a long period of time, what the original assigner's intentions 223 were. No one can tell if successful stewardship will take place 224 because no one can predict the future. Reasonable conjecture, 225 however, may be based on past performance. There must be a way to 226 tie a promise of persistence to a provider's demonstrated or 227 perceived ability -- its reputation -- in that arena. Provider 228 reputations would then rise and fall as promises are observed 229 variously to be kept and broken. This is perhaps the best way we 230 have for gauging the strength of any persistence promise. 232 The second requirement of an ARK is to give users a link from an 233 object to a description of it. The problem with a naked identifier 234 is that without a description real identification is incomplete. 235 Identifiers common today are relatively opaque, though some contain 236 ad hoc clues reflecting assertions that were briefly true, such as 237 where in a filesystem hierarchy an object lived during a short stay. 238 Possession of both an identifier and an object is some improvement, 239 but positive identification may still be uncertain since the object 240 itself might not include a matching identifier or might not carry 241 evidence obvious enough to reveal its identity without significant 242 research. In either case, what is called for is a record bearing 243 witness to the identifier's association with the object, as supported 244 by a recorded set of object characteristics. This descriptive record 245 is partly an identification "receipt" with which users and archivists 246 can verify an object's identity after brief inspection and a 247 plausible match with recorded characteristics such as title and size. 249 The final requirement of an ARK is to give users a link to the object 250 itself (or to a copy) if at all possible. Persistent identification 251 plays a vital supporting role but, strictly speaking, it can be 252 construed as no more than a record attesting to the original 253 assignment of a never-reassigned identifier. Object access may not 254 be feasible for various reasons, such as a transient service outage, 255 a catastrophic loss, a licensing agreement that keeps an archive 256 "dark" for a period of years, or when an object's own lack of 257 tangible existence confuses normal concepts of access (e.g., a 258 vocabulary term might be "accessed" through its definition). In such 259 cases the ARK's identification role assumes a much higher profile. 260 But attempts to simplify the persistence problem by decoupling access 261 from identification and concentrating exclusively on the latter are 262 of questionable utility. A perfect system for assigning forever 263 unique identifiers might be created, but if it did so without 264 reducing access failure rates, no one would be interested. The 265 central issue -- which may be crudely summed up as the "HTTP 404 Not 266 Found" problem -- would not have been addressed. 268 The central duty of an ARK is a high-quality experience of access and 269 identification. This means supporting reliable access during the 270 period described in its stewardship promise and, failing that, 271 supporting reliable access to a record describing the thing the ARK 272 is associated with. 274 ARK resolvers must support the `?info' inflection for requesting 275 metadata. Older versions of this specification distinguished between 276 two minimal inflections: `?' (brief metadata) and `??' (more 277 metadata). While these older inflections are still reserved, because 278 they have proven hard to recognize in some environments, supporting 279 them is optional. 281 1.3. Organizing Support for ARKs: Our Stuff vs. Their Stuff 283 An organization and the user community it serves can often be seen to 284 struggle with two different areas of persistent identification: the 285 Our Stuff problem and the Their Stuff problem. In the Our Stuff 286 problem, we in the organization want our own objects to acquire 287 persistent names. Since we possess or control these objects, our 288 organization tackles the Our Stuff problem directly. Whether or not 289 the objects are named by ARKs, our organization is the responsible 290 party, so it can plan for, maintain, and make commitments about the 291 objects. 293 In the Their Stuff problem, we in the organization want others' 294 objects to acquire persistent names. These are objects that we do 295 not own or control, but some of which are critically important to us. 296 But because they are beyond our influence as far as support is 297 concerned, creating and maintaining persistent identifiers for Their 298 Stuff is not especially purposeful or feasible for us to engage in. 299 There is little that we can do about someone else's stuff except 300 encourage their uptake or adoption of persistence services. 302 Co-location of persistent access and identification services is 303 natural. Any organization that undertakes ongoing support of true 304 persistent identification (which includes description) is well-served 305 if it controls, owns, or otherwise has clear internal access to the 306 identified objects, and this gives it an advantage if it wishes also 307 to support persistent access to outsiders. Conversely, persistent 308 access to outsiders requires orderly internal collection management 309 procedures that include monitoring, acquisition, verification, and 310 change control over objects, which in turn requires object 311 identifiers persistent enough to support auditable record keeping 312 practices. 314 Although organizing ARK support under one roof thus tends to make 315 sense, object hosting can successfully be separated from name 316 mapping. An example is when a name mapping authority centrally 317 provides uniform resolution services via a protocol gateway on behalf 318 of organizations that host objects behind a variety of access 319 protocols. It is also reasonable to build value-added description 320 services that rely on the underlying services of a set of mapping 321 authorities. 323 Supporting ARKs is not for every organization. By requiring 324 specific, revealed commitments to preservation, to object access, and 325 to description, the bar for providing ARK services is higher than for 326 some other identifier schemes. On the other hand, it would be hard 327 to grant credence to a persistence promise from an organization that 328 could not muster the minimum ARK services. Not that there isn't a 329 business model for an ARK-like, description-only service built on top 330 of another organization's full complement of ARK services. For 331 example, there might be competition at the description level for 332 abstracting and indexing a body of scientific literature archived in 333 a combination of open and fee-based repositories. The description- 334 only service would have no direct commitment to the objects, but 335 would act as an intermediary, forwarding commitment statements from 336 object hosting services to requestors. 338 1.4. Definition of Identifier 340 An identifier is not a string of character data -- an identifier is 341 an association between a string of data and an object. This 342 abstraction is necessary because without it a string is just data. 343 It's nonsense to talk about a string's breaking, or about its being 344 strong, maintained, and authentic. But as a representative of an 345 association, a string can do, metaphorically, the things that we 346 expect of it. 348 Without regard to whether an object is physical, digital, or 349 conceptual, to identify it is to claim an association between it and 350 a representative string, such as "Jane" or "ISBN 0596000278". What 351 gives a claim credibility is a set of verifiable assertions, or 352 metadata, about the object, such as age, height, title, or number of 353 pages. In other words, the association is made manifest by a record 354 (e.g., a cataloging or other metadata record) that vouches for it. 356 In the complete absence of any testimony (metadata) regarding an 357 association, a would-be identifier string is a meaningless sequence 358 of characters. To keep an externally visible but otherwise internal 359 string from being perceived as an identifier by outsiders, for 360 example, it suffices for an organization not to disclose the nature 361 of its association. For our immediate purpose, actual existence of 362 an association record is more important than its authenticity or 363 verifiability, which are outside the scope of this specification. 365 It is a gift to the identification process if an object carries its 366 own name as an inseparable part of itself, such as an identifier 367 imprinted on the first page of a document or embedded in a data 368 structure element of a digital document header. In cases where the 369 object is large, unwieldy, or unavailable (such as when licensing 370 restrictions are in effect), a metadata record that includes the 371 identifier string will usually suffice. That record becomes a 372 conveniently manipulable object surrogate, acting as both an 373 association "receipt" and "declaration". 375 Note that our definition of identifier extends the one in use for 376 Uniform Resource Identifiers [RFC3986]. The present document still 377 sometimes (ab)uses the terms "ARK" and "identifier" as shorthand for 378 the string part of an identifier, but the context should make the 379 meaning clear. 381 2. ARK Anatomy 383 An ARK is represented by a sequence of characters (a string) that 384 contains the label, "ark:", optionally preceded by the beginning part 385 of a URL. Here is a diagrammed example. 387 ARK ANATOMY 388 =========== 390 Resolver Service Base Object Name Qualifiers 391 __________________ ________________ _____________ 392 / \/ \/ \ 393 https://example.org/ark:12345/x54xz321/s3/f8.05v.tiff 394 \_________/ \__/\___/\_/\____/\____/\_______/ 395 | Label | | Blade | | 396 | | | | | 397 Name Mapping Authority (NMA) | Shoulder Sub-parts Variants 398 | \_______/ 399 | Assigned Base Name 400 | 401 Name Assigning Authority Number (NAAN) 403 The ARK syntax can be summarized, 405 [https://NMA/]ark:[/]NAAN/Name[Qualifier] 407 where the NMA, '/', and Qualifier parts are in brackets to indicate 408 that they are optional. The Base Object Name is the substring 409 comprising the "ark:" label, the NAAN and the assigned Name. The 410 Resolver Service is replaceable and makes the ARK actionable for a 411 period of time. Without the Resolver Service part, what remains is 412 the Core Immutable Identity (the "persistible") part of the ARK. 414 2.1. The Name Mapping Authority (NMA) 416 Before the "ark:" label may appear an optional Name Mapping Authority 417 (NMA) that is a temporary address where ARK service requests may be 418 sent. Preceded by a URI-type protocol designation such as 419 "https://", it specifies a Resolver Service. The NMA itself is an 420 Internet hostname or host/port combination having the same format and 421 semantics as the host/port part of a URL. The most important thing 422 about the NMA is that it is "identity inert" from the point of view 423 of object identification. In other words, ARKs that differ only in 424 the optional NMA part identify the same object. Thus, for example, 425 the following three ARKs are synonyms for just one information 426 object: 428 https://loc.gov/ark:12345/x54xz321 429 https://rutgers.edu/ark:12345/x54xz321 430 ark:12345/x54xz321 432 Strictly speaking, in the realm of digital objects, these ARKs may 433 lead over time to somewhat different or diverging instances of the 434 originally named object. In an ideal world, divergence of persistent 435 objects is not desirable, but it is widely believed that digital 436 preservation efforts will inevitably lead to alterations in some 437 original objects (e.g, a format migration in order to preserve the 438 ability to display a document). If any of those objects are held 439 redundantly in more than one organization (a common preservation 440 strategy), chances are small that all holding organizations will 441 perform the same precise transformations and all maintain the same 442 object metadata. More significant divergence would be expected when 443 the holding organizations serve different audiences or compete with 444 each other. 446 The NMA part makes an ARK into an actionable URL. As with many 447 internet parameters, it is helpful to approach the NMA being liberal 448 in what you accept and conservative in what you propose. From the 449 recipient's point of view, the NMA part should be treated as 450 temporary, disposable, and replaceable. From the NMA's point of 451 view, it should be chosen with the greatest concern for longevity. A 452 carefully chosen NMA should be at least as permanent as the providing 453 organization's own hostname. In the case of a national or university 454 library, for example, there is no reason why the NMA should not be 455 considerably more permanent than soft-funded proxy hostnames such as 456 hdl.handle.net, dx.doi.org, and purl.org. In general and over time, 457 however, it is not unexpected for an NMA eventually to stop working 458 and require replacement with the NMA of a currently active service 459 provider. 461 This replacement relies on a mapping authority "resolver" discovery 462 process, of which two alternate methods are outlined in a later 463 section. The ARK, URN, Handle, and DOI schemes all use a resolver 464 discovery model that sooner or later requires matching the original 465 assigning authority with a current provider servicing that 466 authority's named objects; once found, the resolver at that provider 467 performs what amounts to a redirect to a place where the object is 468 currently held. All the schemes rely on the ongoing functionality of 469 currently mainstream technologies such as the Domain Name System 470 [RFC1034] and web browsers. The Handle and DOI schemes in addition 471 require that the Handle protocol layer and global server grid be 472 available at all times. 474 The practice of prepending "https://" and an NMA to an ARK is a way 475 of creating an actionable identifier by a method that is itself 476 temporary. Assuming that infrastructure supporting [RFC2616] 477 information retrieval will no longer be available one day, ARKs will 478 then have to be converted into new kinds of actionable identifiers. 479 By that time, if ARKs see widespread use, web browsers would 480 presumably evolve to perform this (currently simple) transformation 481 automatically. 483 2.2. The ARK Label Part (ark:) 485 The label part distinguishes an ARK from an ordinary identifier. 486 There is a new form of the label, "ark:", and an old form, "ark:/", 487 both of which must be recognized in perpetuity. Implementations 488 should generate new ARKs in the new form (without the "/") and 489 resolvers must always treat received ARKs as equivalent if they 490 differ only in regard to new form versus old form labels. Thus these 491 two ARKs are equivalent: 493 ark:/12345/x54xz321 494 ark:12345/x54xz321 496 In a URL found in the wild, the label indicates that the URL stands a 497 reasonable chance of being an ARK. If the context warrants, 498 verification that it actually is an ARK can be done by testing it for 499 existence of the three ARK services. 501 Since nothing about an identifier syntax directly affects 502 persistence, the "ark:" label (like "urn:", "doi:", and "hdl:") 503 cannot tell you whether the identifier is persistent or whether the 504 object is available. It does tell you that the original Name 505 Assigning Authority (NAA) had some sort of hopes for it, but it 506 doesn't tell you whether that NAA is still in existence, or whether a 507 decade ago it ceased to have any responsibility for providing 508 persistence, or whether it ever had any responsibility beyond naming. 510 Only a current provider can say for certain what sort of commitment 511 it intends, and the ARK label suggests that you can query the NMA 512 directly to find out exactly what kind of persistence is promised. 513 Even if what is promised is impersistence (i.e., a short-term 514 identifier), saying so is valuable information to the recipient. 515 Thus an ARK is a high-functioning identifier in the sense that it 516 provides access to the object, the metadata, and a commitment 517 statement, even if the commitment is explicitly very weak. 519 2.3. The Name Assigning Authority Number (NAAN) 521 Recalling that the general form of the ARK is, 523 [https://NMA/]ark:[/]NAAN/Name[Qualifier] 525 the part of the ARK directly following the "ark:" (or older "ark:/") 526 label is the Name Assigning Authority Number (NAAN), up to but not 527 including the next `/' (slash) character. This part is always 528 required, as it identifies a hostname of the organization that 529 originally assigned the Name of the object. Typically the 530 organization is an institution, a department, a laboratory, or any 531 group that conducts a stable, policy-driven name assigning effort. 532 It is used to discover a currently valid NMA and to provide top-level 533 partitioning of the space of all ARKs. 535 An organization may request a NAAN from the ARK Maintenance Agency 536 [ARKagency] (described in Appendix A) by filling out the form at 537 [NAANrequest]. NAANs are opaque strings of one or more "betanumeric" 538 characters, specifically, 540 0123456789bcdfghjkmnpqrstvwxz 542 which consists of digits and consonants, minus the letter 'l'. 543 Restricting NAANs to betanumerics (alphanumerics without vowels or 544 'l') serves two goals. It reduces the chances that words -- past, 545 present, and future -- will appear in NAANs and carry unintended 546 semantics. It also helps usability by not mixing commonly confused 547 characters ('0' and 'O', '1' and 'l') and by being compatible with 548 strong transcription error detection (eg, the [NOID] check digit 549 algorithm). Since 2001, every assigned NAAN has consisted of exactly 550 five digits. 552 The NAAN designates a top-level ARK namespace. Once registered for a 553 namespace, a NAAN is never re-registered. It is possible, however, 554 for there to be a succession of organizations that manage an ARK 555 namespace. 557 2.4. The Name Part 559 The part of the ARK just after the NAAN is the Name assigned by the 560 NAA, and it is also required. Semantic opaqueness in the Name part 561 is strongly encouraged in order to reduce an ARK's vulnerability to 562 era- and language-specific change. Identifier strings containing 563 linguistic fragments can create support difficulties down the road. 564 No matter how appropriate or even meaningless they are today, such 565 fragments may one day create confusion, give offense, or infringe on 566 a trademark as the semantic environment around us and our communities 567 evolves. 569 Names that look more or less like numbers avoid common problems that 570 defeat persistence and international acceptance. The use of digits 571 is highly recommended. Mixing in non-vowel alphabetic characters 572 (eg, betanumerics) a couple at a time is a relatively safe and easy 573 way to achieve a denser namespace (more possible names for a given 574 length of the name string). Such names have a chance of aging and 575 traveling well. The absence of recognizable words makes typos harder 576 to detect in opaque strings, so a common mitigation is to add a check 577 character. Tools exists that mint, bind, and resolve opaque 578 identifiers, with or without check characters [NOID]. More on naming 579 considerations is given in a subsequent section. 581 2.4.1. Optional: Shoulder and Blade 583 Just as a ARK namespace is subdivided by NAANs reserved for NAAs, 584 each NAAN is a namespace that can be subdivided into "shoulders", 585 where each shoulder is reserved for an internal department or unit. 586 Like the NAAN, which is a string of characters that follows the 587 "ark:" label, a shoulder is a string of characters (starting with a 588 "/") that extends the NAAN. The base object name assigned by the NAA 589 consists of the NAAN, the shoulder, a final string known as the 590 "blade". (The shoulder plus blade terminology mirrors locksmith 591 jargon describing the information-bearing parts of a key.) 593 The blade string is chosen by the NAA such that the string created by 594 concatenating the NAAN plus shoulder plus blade becomes the unique 595 base object name. Otherwise the blade may come from any source, for 596 example, it might come from a counter, a timestamp, a [NOID] minter, 597 a legacy 100-year-old accession number, etc. If there is a check 598 digit, it is expected to appear at the end of the blade and to be 599 computed over the base object name, which is generally the most 600 important part of an ARK to make opaque. In particular, check digits 601 are not expected to cover qualifiers, which often name subobjects of 602 a persistent object that are less stable and less opaquely named than 603 the parent object (for example, ten years hence, the object's 604 thumbnail image will be of a higher resolution and the OCR text file 605 will be re-derived with improved algorithms. 607 It is important not to use any delimiter between the shoulder string 608 and blade string, especially not a "/" since it declares an object 609 boundary (see the section on ARKs that reveal object hierarchy). 610 This little bit of discretion shields organizations from end users 611 making inferences about expected levels of support based on 612 recognizable shoulders. To help in-house ARK administrators reliably 613 know where the shoulder ends, it is recommended to use the "first- 614 digit convention" so that shoulders are "primordinal". A primordinal 615 shoulder is a sequence of one or more betanumeric characters ending 616 in a digit. This means that the shoulder is all consonant letters 617 (often just one) after the NAAN and "/" up to and including the first 618 digit encountered after the NAAN. One property of primordinal 619 shoulders is that there is an infinite number of them possible under 620 any NAAN. 622 To help manage each namespace into the future, NAAs are encouraged to 623 create at shoulders, even if there is only one to start with. There 624 are four NAANs (99999, 12345, 99152, 99166, XXX describe these) that 625 are shared across organizations. The create a shoulder on one of 626 them requires a registration process (XXX). 628 2.5. The Qualifier Part 630 The part of the ARK following the NAA-assigned Name is an optional 631 Qualifier. It is a string that extends the base ARK in order to 632 create a kind of service entry point into the object named by the 633 NAA. At the discretion of the providing NMA, such a service entry 634 point permits an ARK to support access to individual hierarchical 635 components and subcomponents of an object, and to variants (versions, 636 languages, formats) of components. A Qualifier may be invented by 637 the NAA or by any NMA servicing the object. 639 In form, the Qualifier is a ComponentPath, or a VariantPath, or a 640 ComponentPath followed by a VariantPath. A VariantPath is introduced 641 and subdivided by the reserved character `.', and a ComponentPath is 642 introduced and subdivided by the reserved character `/'. In this 643 example, 645 https://example.org/ark:12345/x54xz321/s3/f8.05v.tiff 647 the string "/s3/f8" is a ComponentPath and the string ".05v.tiff" is 648 a VariantPath. The ARK Qualifier is a formalization of some 649 currently mainstream URL syntax conventions. This formalization 650 specifically reserves meanings that permit recipients to make strong 651 inferences about logical sub-object containment and equivalence based 652 only on the form of the received identifiers; there is great 653 efficiency in not having to inspect metadata records to discover such 654 relationships. NMAs are free not to disclose any of these 655 relationships merely by avoiding the reserved characters above. 656 Hierarchical components and variants are discussed further in the 657 next two sections. 659 The Qualifier, if present, differs from the Name in several important 660 respects. First, a Qualifier may have been assigned either by the 661 NAA or later by the NMA. The assignment of a Qualifier by an NMA 662 effectively amounts to an act of publishing a service entry point 663 within the conceptual object originally named by the NAA. For our 664 purposes, an ARK extended with a Qualifier assigned by an NMA will be 665 called an NMA-qualified ARK. 667 Second, a Qualifier assignment on the part of an NMA is made in 668 fulfillment of its service obligations and may reflect changing 669 service expectations and technology requirements. NMA-qualified ARKs 670 could therefore be transient, even if the base, unqualified ARK is 671 persistent. For example, it would be reasonable for an NMA to 672 support access to an image object through an actionable ARK that is 673 considered persistent even if the experience of that access changes 674 as linking, labeling, and presentation conventions evolve and as 675 format and security standards are updated. For an image "thumbnail", 676 that NMA could also support an NMA-qualified ARK that is considered 677 impersistent because the thumbnail will be replaced with higher 678 resolution images as network bandwidth and CPU speeds increase. At 679 the same time, for an originally scanned, high-resolution master, the 680 NMA could publish an NMA-qualfied ARK that is itself considered 681 persistent. Of course, the NMA must be able to return its separate 682 commitments to unqualified, NAA-assigned ARKs, to NMA-qualified ARKs, 683 and to any NAA-qualified ARKs that it supports. 685 A third difference between a Qualifier and a Name concerns the 686 semantic opaqueness constraint. When an NMA-qualified ARK is to be 687 used as a transient service entry point into a persistent object, the 688 priority given to semantic opaqueness observed by the NAA in the Name 689 part may be relaxed by the NMA in the Qualifier part. If service 690 priorities in the Qualifier take precedence over persistence, short- 691 term usability considerations may recommend somewhat semantically 692 laden Qualifier strings. 694 Finally, not only is the set of Qualifiers supported by an NMA 695 mutable, but different NMAs may support different Qualifier sets for 696 the same NAA-identified object. In this regard the NMAs act 697 independently of each other and of the NAA. 699 The next two sections describe how ARK syntax may be used to declare, 700 or to avoid declaring, certain kinds of relatedness among qualified 701 ARKs. 703 2.5.1. ARKs that Reveal Object Hierarchy 705 An NAA or NMA may choose to reveal the presence of a hierarchical 706 relationship between objects using the `/' (slash) character after 707 the Name part of an ARK. Some authorities will choose not to 708 disclose this information, while others will go ahead and disclose so 709 that manipulators of large sets of ARKs can infer object 710 relationships by simple identifier inspection; for example, this 711 makes it possible for a system to present a collapsed view of a large 712 search result set. 714 If the ARK contains an internal slash after the NAAN, the piece to 715 its left indicates a containing object. For example, publishing an 716 ARK of the form, 717 ark:12345/x54/xz/321 719 is equivalent to publishing three ARKs, 721 ark:12345/x54/xz/321 722 ark:12345/x54/xz 723 ark:12345/x54 725 together with a declaration that the first object is contained in the 726 second object, and that the second object is contained in the third. 728 Revealing the presence of hierarchy is completely up to the assigner 729 (NMA or NAA). It is hard enough to commit to one object's name, let 730 alone to three objects' names and to a specific, ongoing relatedness 731 among them. Thus, regardless of whether hierarchy was present 732 initially, the assigner, by not using slashes, reveals no shared 733 inferences about hierarchical or other inter-relatedness in the 734 following ARKs: 736 ark:12345/x54_xz_321 737 ark:12345/x54_xz 738 ark:12345/x54xz321 739 ark:12345/x54xz 740 ark:12345/x54 742 Note that slashes around the ARK's NAAN (/12345/ in these examples) 743 are not part of the ARK's Name and therefore do not indicate the 744 existence of some sort of NAAN super object containing all objects in 745 its namespace. A slash must have at least one non-structural 746 character (one that is neither a slash nor a period) on both sides in 747 order for it to separate recognizable structural components. So 748 initial or final slashes may be removed, and double slashes may be 749 converted into single slashes. 751 2.5.2. ARKs that Reveal Object Variants 753 An NAA or NMA may choose to reveal the possible presence of variant 754 objects or object components using the `.' (period) character after 755 the Name part of an ARK. Some authorities will choose not to 756 disclose this information, while others will go ahead and disclose so 757 that manipulators of large sets of ARKs can infer object 758 relationships by simple identifier inspection; for example, this 759 makes it possible for a system to present a collapsed view of a large 760 search result set. 762 If the ARK contains an internal period after Name, the piece to its 763 left is a root name and the piece to its right, and up to the end of 764 the ARK or to the next period is a suffix. A Name may have more than 765 one suffix, for example, 767 ark:12345/x54.24 768 ark:12345/x4z/x54.24 769 ark:12345/x54.20v.78g.f55 771 There are two main rules. First, if two ARKs share the same root 772 name but have different suffixes, the corresponding objects were 773 considered variants of each other (different formats, languages, 774 versions, etc.) by the assigner (NMA or NAA). Thus, the following 775 ARKs are variants of each other: 777 ark:12345/x54.20v.78g.f55 778 ark:12345/x54.321xz 779 ark:12345/x54.44 781 Second, publishing an ARK with a suffix implies the existence of at 782 least one variant identified by the ARK without its suffix. The ARK 783 otherwise permits no further assumptions about what variants might 784 exist. So publishing the ARK, 786 ark:12345/x54.20v.78g.f55 788 is equivalent to publishing the four ARKs, 790 ark:12345/x54.20v.78g.f55 791 ark:12345/x54.20v.78g 792 ark:12345/x54.20v 793 ark:12345/x54 795 Revealing the possibility of variants is completely up to the 796 assigner. It is hard enough to commit to one object's name, let 797 alone to multiple variants' names and to a specific, ongoing 798 relatedness among them. The assigner is the sole arbiter of what 799 constitutes a variant within its namespace, and whether to reveal 800 that kind of relatedness by using periods within its names. 802 A period must have at least one non-structural character (one that is 803 neither a slash nor a period) on both sides in order for it to 804 separate recognizable structural components. So initial or final 805 periods may be removed, and adjacent periods may be converted into a 806 single period. Multiple suffixes should be arranged in sorted order 807 (pure ASCII collating sequence) at the end of an ARK. 809 2.6. Character Repertoires 811 The Name and Qualifier parts are strings of visible ASCII characters. 812 For received ARKs, implementations must support a minimum length of 813 255 octets for the string composed of the Base ARK plus Qualifier. 814 Implementations generating strings exceeding this length should 815 understand that receiving implementations may not be able to index 816 such ARKs properly. Characters may be letters, digits, or any of 817 these seven characters: 819 = ~ * + @ _ $ 821 The following characters may also be used, but their meanings are 822 reserved: 824 % - . / 826 The characters `/' and `.' are ignored if either appears as the last 827 character of an ARK. If used internally, they allow a name assigner 828 to reveal object hierarchy and object variants as previously 829 described. 831 Hyphens are considered to be insignificant and are always ignored in 832 ARKs. A `-' (hyphen) may appear in an ARK for readability, or it may 833 have crept in during the formatting and wrapping of text, but it must 834 be ignored in lexical comparisons. As in a telephone number, hyphens 835 have no meaning in an ARK. It is always safe for an NMA that 836 receives an ARK to remove any hyphens found in it. As a result, like 837 the NMA, hyphens are "identity inert" in comparing ARKs for 838 equivalence. For example, the following ARKs are equivalent for 839 purposes of comparison and ARK service access: 841 ark:12345/x5-4-xz-321 842 https://sneezy.dopey.com/ark:12345/x54--xz32-1 843 ark:12345/x54xz321 845 The `%' character is reserved for %-encoding all other octets that 846 would appear in the ARK string, in the same manner as for URIs 847 [RFC3986]. A %-encoded octet consists of a `%' followed by two hex 848 digits; for example, "%7d" stands in for `}'. Lower case hex digits 849 are preferred to reduce the chances of false acronym recognition; 850 thus it is better to use "%acT" instead of "%ACT". The character `%' 851 itself must be represented using "%25". As with URNs, %-encoding 852 permits ARKs to support legacy namespaces (e.g., ISBN, ISSN, SICI) 853 that have less restricted character repertoires [RFC2288]. 855 2.7. Normalization and Lexical Equivalence 857 To determine if two or more ARKs identify the same object, the ARKs 858 are compared for lexical equivalence after first being normalized. 859 Since ARK strings may appear in various forms (e.g., having different 860 NMAs), normalizing them minimizes the chances that comparing two ARK 861 strings for equality will fail unless they actually identify 862 different objects. In a specified-host ARK (one having an NMA), the 863 NMA never participates in such comparisons. Normalization described 864 here serves to define lexical equivalence but does not restrict how 865 implementors normalize ARKs locally for storage. 867 Normalization of a received ARK for the purpose of octet-by-octet 868 equality comparison with another ARK consists of the following steps. 870 1. The NMA part (eg, everything from an initial "https://" up to the 871 next slash), if present is removed. 873 2. Any URI query string is removed (everything from the first 874 literal '?' to the end of the string). 876 3. The first case-insensitive match on "ark:/" or "ark:" is 877 converted to "ark:" (replacing any upper case letters and 878 removing any terminal '/'). 880 4. In the string that remains, the two characters following every 881 occurrence of `%' are converted to lower case. The case of all 882 other letters in the ARK string must be preserved. 884 5. All hyphens are removed. 886 6. If normalization is being done as part of a resolution step, and 887 if the end of the remaining string matches a known inflection, 888 the inflection is noted and removed. 890 7. Structural characters (slash and period) are normalized: initial 891 and final occurrences are removed, and two structural characters 892 in a row (e.g., // or ./) are replaced by the first character, 893 iterating until each occurrence has at least one non-structural 894 character on either side. 896 8. If there are any components with a period on the left and a slash 897 on the right, either the component and the preceding period must 898 be moved to the end of the Name part or the ARK must be thrown 899 out as malformed. 901 The resulting ARK string is now normalized. Comparisons between 902 normalized ARKs are case-sensitive, meaning that upper case letters 903 are considered different from their lower case counterparts. 905 To keep ARK string variation to a minimum, no reserved ARK characters 906 should be %-encoded unless it is deliberately to conceal their 907 reserved meanings. No non-reserved ARK characters should ever be 908 %-encoded. Finally, no %-encoded character should ever appear in an 909 ARK in its decoded form. 911 3. Naming Considerations 913 The most important threats faced by persistence providers include 914 such things as funding loss, natural disaster, political and social 915 upheaval, processing faults, and errors in human oversight. There is 916 nothing that an identifer scheme can do about such things. Still, a 917 few observed identifier failures and inconveniences can be traced 918 back to naming practices that we now know to be less than optimal for 919 persistence. 921 3.1. ARKS Embedded in Language 923 The ARK has different goals from the URI, so it has different 924 character set requirements. Because linguistic constructs imperil 925 persistence, for ARKs non-ASCII character support is unimportant. 926 ARKs and URIs share goals of transcribability and transportability 927 within web documents, so characters are required to be visible, non- 928 conflicting with HTML/XML syntax, and not subject to tampering during 929 transmission across common transport gateways. Add the goal of 930 making an undelimited ARK recognizable in running prose, as in 931 ark:12345/=@_22*$, and certain punctuation characters (e.g., comma, 932 period) end up being excluded from the ARK lest the end of a phrase 933 or sentence be mistaken for part of the ARK. 935 This consideration has more direct effect on ARK usability in a 936 natural language context than it has on ARK persistence. The same is 937 true of the rule preventing hyphens from having lexical significance. 938 It is fine to publish ARKs with hyphens in them (e.g., such as the 939 output of UUID/GUID generators), but the uniform treatment of hyphens 940 as insignificant reduces the possibility of users transcribing 941 identifiers that will have been broken through unpredictable 942 hyphenation by word processors. Any measure that reduces user 943 irritation with an identifier will increase its chances of survival. 945 3.2. Objects Should Wear Their Identifiers 947 A valuable technique for provision of persistent objects is to try to 948 arrange for the complete identifier to appear on, with, or near its 949 retrieved object. An object encountered at a moment in time when its 950 discovery context has long since disappeared could then easily be 951 traced back to its metadata, to alternate versions, to updates, etc. 952 This has seen reasonable success, for example, in book publishing and 953 software distribution. An identifier string only has meaning when 954 its association is known, and this a very sure, simple, and low-tech 955 method of reminding everyone exactly what that association is. 957 3.3. Names are Political, not Technological 959 If persistence is the goal, a deliberate local strategy for 960 systematic name assignment is crucial. Names must be chosen with 961 great care. Poorly chosen and managed names will devastate any 962 persistence strategy, and they do not discriminate by identifier 963 scheme. Whether a mistakenly re-assigned name is a URN, DOI, PURL, 964 URL, or ARK, the damage -- failed access and confusion -- is not 965 mitigated more in one scheme than in another. Conversely, in-house 966 efforts to manage names responsibly will go much further towards 967 safeguarding persistence than any choice of naming scheme or name 968 resolution technology. 970 Branding (e.g., at the corporate or departmental level) is important 971 for funding and visibility, but substrings representing brands and 972 organizational names should be given a wide berth except when 973 absolutely necessary in the hostname (the identity-inert) part of the 974 ARK. These substrings are not only unstable because organizations 975 change frequently, but they are also dangerous because successor 976 organizations often have political or legal reasons to actively 977 suppress predecessor names and brands. Any measure that reduces the 978 chances of future political or legal pressure on an identifier will 979 decrease the chances that our descendants will be obliged to 980 deliberately break it. 982 3.4. Choosing a Hostname or NMA 984 Hostnames appearing in any identifier meant to be persistent must be 985 chosen with extra care. The tendency in hostname selection has 986 traditionally been to choose a token with recognizable attributes, 987 such as a corporate brand, but that tendency wreaks havoc with 988 persistence that is supposed to outlive brands, corporations, subject 989 classifications, and natural language semantics (e.g., what did the 990 three letters "gay" mean in 1958, 1978, and 1998?). Today's 991 recognized and correct attributes are tomorrow's stale or incorrect 992 attributes. In making hostnames (any names, actually) long-term 993 persistent, it helps to eliminate recognizable attributes to the 994 extent possible. This affects selection of any name based on URLs, 995 including PURLs and the explicitly disposable NMAs. 997 There is no excuse for a provider that manages its internal names 998 impeccably not to exercise the same care in choosing what could be an 999 exceptionally durable hostname, especially if it would form the 1000 prefix for all the provider's URL-based external names. Registering 1001 an opaque hostname in the ".org" or ".net" domain would not be a bad 1002 start. Another way is to publish your ARKs with an organizational 1003 domain name that will be mapped by DNS to an appropriate NMA host. 1004 This makes for shorter names with less branding vulnerability. 1006 It is a mistake to think that hostnames are inherently unstable. If 1007 you require brand visibility, that may be a fact of life. But things 1008 are easier if yours is the brand of long-lived cultural memory 1009 institution such as a national or university library or archive. 1010 Well-chosen hostnames from organizations that are sheltered from the 1011 direct effects of a volatile marketplace can easily provide longer- 1012 lived global resolvers than the domain names explicitly or implicitly 1013 used as starting points for global resolution by indirection-based 1014 persistent identifier schemes. For example, it is hard to imagine 1015 circumstances under which the Library of Congress' domain name would 1016 disappear sooner than, say, "handle.net". 1018 For smaller libraries, archives, and preservation organizations, 1019 there is a natural concern about whether they will be able to keep 1020 their web servers and domain names in the face of uncertain funding. 1021 One option is to form or join a consortium [N2T] of like-minded 1022 organizations with the purpose of providing mutual preservation 1023 support. The first goal of such a consortium would be to perpetually 1024 rent a hostname on which to establish a web server that simply 1025 redirects incoming member organization requests to the appropriate 1026 member server; using ARKs, for example, a 150-member consortium could 1027 run a very small server (24x7) that contained nothing more than 150 1028 rewrite rules in its configuration file. Even more helpful would be 1029 additional consortial support for a member organization that was 1030 unable to continue providing services and needed to find a successor 1031 archival organization. This would be a low-cost, low-tech way to 1032 publish ARKs (or URLs) under highly persistent hostnames. 1034 There are no obvious reasons why the organizations registering DNS 1035 names, URN Namespaces, and DOI publisher IDs should have among them 1036 one that is intrinsically more fallible than the next. Moreover, it 1037 is a misconception that the demise of DNS and of HTTP need adversely 1038 affect the persistence of URLs. At such a time, certainly URLs from 1039 the present day might not then be actionable by our present-day 1040 mechanisms, but resolution systems for future non-actionable URLs are 1041 no harder to imagine than resolution systems for present-day non- 1042 actionable URNs and DOIs. There is no more stable a namespace than 1043 one that is dead and frozen, and that would then characterize the 1044 space of names bearing the "http://" or "https://" prefix. It is 1045 useful to remember that just because hostnames have been carelessly 1046 chosen in their brief history does not mean that they are unsuitable 1047 in NMAs (and URLs) intended for use in situations demanding the 1048 highest level of persistence available in the Internet environment. 1049 A well-planned name assignment strategy is everything. 1051 3.5. Assigners of ARKs 1053 A Name Assigning Authority (NAA) is an organization that creates (or 1054 delegates creation of) long-term associations between identifiers and 1055 information objects. Examples of NAAs include national libraries, 1056 national archives, and publishers. An NAA may arrange with an 1057 external organization for identifier assignment. The US Library of 1058 Congress, for example, allows OCLC (the Online Computer Library 1059 Center, a major world cataloger of books) to create associations 1060 between Library of Congress call numbers (LCCNs) and the books that 1061 OCLC processes. A cataloging record is generated that testifies to 1062 each association, and the identifier is included by the publisher, 1063 for example, in the front matter of a book. 1065 An NAA does not so much create an identifier as create an 1066 association. The NAA first draws an unused identifier string from 1067 its namespace, which is the set of all identifiers under its control. 1068 It then records the assignment of the identifier to an information 1069 object having sundry witnessed characteristics, such as a particular 1070 author and modification date. A namespace is usually reserved for an 1071 NAA by agreement with recognized community organizations (such as 1072 IANA and ISO) that all names containing a particular string be under 1073 its control. In the ARK an NAA is represented by the Name Assigning 1074 Authority Number (NAAN). 1076 The ARK namespace reserved for an NAA is the set of names bearing its 1077 particular NAAN. For example, all strings beginning with 1078 "ark:12345/" are under control of the NAA registered under 12345, 1079 which might be the National Library of Finland. Because each NAA has 1080 a different NAAN, names from one namespace cannot conflict with those 1081 from another. Each NAA is free to assign names from its namespace 1082 (or delegate assignment) according to its own policies. These 1083 policies must be documented in a manner similar to the declarations 1084 required for URN Namespace registration [RFC2611]. 1086 Organizations can request or update a NAAN by filling out a form 1087 [NAANrequest]. 1089 3.6. NAAN Namespace Management 1091 Every NAA must have a namespace management strategy. A time-honored 1092 technique is to hierarchically partition a namespace into 1093 subnamespaces using prefixes that guarantee non-collision of names in 1094 different partition. This practice is strongly encouraged for all 1095 NAAs, especially when subnamespace management will be delegated to 1096 other departments, units, or projects within an organization. For 1097 example, with a NAAN that is assigned to a university and managed by 1098 its main library, care should be taken to reserve semantically opaque 1099 prefixes that will set aside large parts of the unused namespace for 1100 future assignments. Prefix-based partition management is an 1101 important responsibility of the NAA. 1103 This sort of delegation by prefix is well-used in the formation of 1104 DNS names and ISBN identifiers. An important difference is that in 1105 the former, the hierarchy is deliberately exposed and in the latter 1106 it is hidden. Rather than using lexical boundary markers such as the 1107 period (`.') found in domain names, the ISBN uses a publisher prefix 1108 but doesn't disclose where the prefix ends and the publisher's 1109 assigned name begins. This practice of non-disclosure, borrowed from 1110 the ISBN and ISSN schemes, is encouraged in assigning ARKs, because 1111 it reduces the visibility of an assertion that is probably not 1112 important now and may become a vulnerability later. 1114 Reasonable prefixes for assigned names usually consist of consonants 1115 and digits and are 1-5 characters in length. For example, the 1116 constant prefix "x9t" might be delegated to a book digitization 1117 project that creates identifiers such as 1119 https://444.berkeley.edu/ark:28722/x9t38rk45c 1121 If longevity is the goal, it is important to keep the prefixes free 1122 of recognizable semantics; for example, using an acronym representing 1123 a project or a department is discouraged. At the same time, you may 1124 wish to set aside a subnamespace for testing purposes under a prefix 1125 such as "fk..." that can serve as a visual clue and reminder to 1126 maintenance staff that this "fake" identifier was never published. 1128 There are other measures one can take to avoid user confusion, 1129 transcription errors, and the appearance of accidental semantics when 1130 creating identifiers. If you are generating identifiers 1131 automatically, pure numeric identifiers are likeley to be 1132 semantically opaque enough, but it's probably useful to avoid leading 1133 zeroes because some users mistakenly treat them as optional, thinking 1134 (arithmetically) that they don't contribute to the "value" of the 1135 identifier. 1137 If you need lots of identifiers and you don't want them to get too 1138 long, you can mix digits with consonants (but avoid vowels since they 1139 might accidentally spell words) to get more identifiers without 1140 increasing the string length. In this case you may not want more 1141 than a two letters in a row because it reduces the chance of 1142 generating acronyms. Generator tools such as [NOID] provide support 1143 for these sorts of identifiers, and can also add a computed check 1144 character as a guarantee against the most common transcription 1145 errors. If used, it is recommended that the check character be 1146 appended to the original Base Object Name string (ie, minus the check 1147 character), that original string having been the basis for computing 1148 the check character. 1150 3.7. Sub-Object Naming 1152 As mentioned previously, semantically opaque identifiers are very 1153 useful for long-term naming of abstract objects, however, it may be 1154 appropriate to extend these names with less opaque extensions that 1155 reference contemporary service entry points (sub-objects) in support 1156 of the object. Sub-object extensions beginning with a digit or 1157 underscore (`_') are reserved for the possibilty of developing a 1158 future registry of canonical service points (e.g., numeric references 1159 to versions, formats, languages, etc). 1161 4. Finding a Name Mapping Authority 1163 In order to derive an actionable identifier (these days, a URL) from 1164 an ARK, a hostname (or hostname plus port combination) for a working 1165 Name Mapping Authority (NMA) must be found. An NMA is a service that 1166 is able to respond to basic ARK service requests. Relying on 1167 registration and client-side discovery, NMAs make known which NAAs' 1168 identifiers they are willing to service. 1170 Upon encountering an ARK, a user (or client software) looks inside it 1171 for the optional NMA part (the host part of the NMA's ARK service). 1172 If it contains an NMA that is working, this NMA discovery step may be 1173 skipped; the NMA effectively uses the beginning of an ARK to cache 1174 the results of a prior mapping authority discovery process. If a new 1175 NMA needs to found, the client looks inside the ARK again for the 1176 NAAN (Name Assigning Authority Number). Querying a global database, 1177 it then uses the NAAN to look up all current NMAs that service ARKs 1178 issued by the identified NAA. 1180 The global database is key, and ideally the lookup would be automatic 1181 and transparent to the user. For this, the most promising method is 1182 probably the Name-to-Thing (N2T) Resolver [N2T] at n2t.net. It is a 1183 proposed low-cost, highly reliable, consortially maintained NMA that 1184 simply exists to support actionable HTTP-based URLs for as long as 1185 HTTP is used. One of its big advantages over the other two methods 1186 and the URN, Handle, DOI, and PURL methods, is that N2T addresses the 1187 namespace splitting problem. When objects maintained by one NMA are 1188 inherited by more than one successor NMA, until now one of those 1189 successors would be required to maintain forwarding tables on behalf 1190 of the other successors. 1192 There are two other ways to discover an NMA, one of them described in 1193 a subsection below. Another way, described in an appendix, is based 1194 on a simplification of the URN resolver discovery method, itself very 1195 similar in principle to the resolver discovery method used by Handles 1196 and DOIs. None of these methods does more than what can be done with 1197 a very small, consortially maintained web server such as [N2T]. 1199 In the interests of long-term persistence, however, ARK mechanisms 1200 are first defined in high-level, protocol-independent terms so that 1201 mechanisms may evolve and be replaced over time without compromising 1202 fundamental service objectives. Either or both specific methods 1203 given here may eventually be supplanted by better methods since, by 1204 design, the ARK scheme does not depend on a particular method, but 1205 only on having some method to locate an active NMA. 1207 At the time of issuance, at least one NMA for an ARK should be 1208 prepared to service it. That NMA may or may not be administered by 1209 the Name Assigning Authority (NAA) that created it. Consider the 1210 following hypothetical example of providing long-term access to a 1211 cancer research journal. The publisher wishes to turn a profit and 1212 the National Library of Medicine wishes to preserve the scholarly 1213 record. An agreement might be struck whereby the publisher would act 1214 as the NAA and the national library would archive the journal issue 1215 when it appears, but without providing direct access for the first 1216 six months. During the first six months of peak commercial 1217 viability, the publisher would retain exclusive delivery rights and 1218 would charge access fees. Again, by agreement, both the library and 1219 the publisher would act as NMAs, but during that initial period the 1220 library would redirect requests for issues less than six months old 1221 to the publisher. At the end of the waiting period, the library 1222 would then begin servicing requests for issues older than six months 1223 by tapping directly into its own archives. Meanwhile, the publisher 1224 might routinely redirect incoming requests for older issues to the 1225 library. Long-term access is thereby preserved, and so is the 1226 commercial incentive to publish content. 1228 Although it will be common for an NAA also to run an NMA service, it 1229 is never a requirement. Over time NAAs and NMAs will come and go. 1230 One NMA will succeed another, and there might be many NMAs serving 1231 the same ARKs simultaneously (e.g., as mirrors or as competitors). 1233 There might also be asymmetric but coordinated NMAs as in the 1234 library-publisher example above. 1236 4.1. Looking Up NMAs in a Globally Accessible File 1238 This subsection describes a way to look up NMAs using a simple name 1239 authority table represented as a plain text file. For efficient 1240 access the file may be stored in a local filesystem, but it needs to 1241 be reloaded periodically to incorporate updates. It is not expected 1242 that the size of the file or frequency of update should impose an 1243 undue maintenance or searching burden any time soon, for even 1244 primitive linear search of a file with ten-thousand NAAs is a 1245 subsecond operation on modern server machines. The proposed file 1246 strategy is similar to the /etc/hosts file strategy that supported 1247 Internet host address lookup for a period of years before the advent 1248 of DNS. 1250 The name authority table file is updated on an ongoing basis and is 1251 available for copying over the internet from a number of mirror sites 1252 [NAANregistry]. The file contains comment lines (lines that begin 1253 with `#') explaining the format and giving the file's modification 1254 time, reloading address, and NAA registration instructions. 1256 5. Generic ARK Service Definition 1258 An ARK request's output is delivered information; examples include 1259 the object itself, a policy declaration (e.g., a promise of support), 1260 a descriptive metadata record, or an error message. The experience 1261 of object delivery is expected to be an evolving mix of information 1262 that reflects changing service expectations and technology 1263 requirements; contemporary examples include such things as an object 1264 summary and component links formatted for human consumption. ARK 1265 services must be couched in high-level, protocol-independent terms if 1266 persistence is to outlive today's networking infrastructural 1267 assumptions. The high-level ARK service definitions listed below are 1268 followed in the next section by a concrete method (one of many 1269 possible methods) for delivering these services with today's 1270 technology. Note that some services may be invoked in one operation, 1271 such as when an '?info' inflection returns both a description and a 1272 permanence declaration for an object. 1274 5.1. Generic ARK Access Service (access, location) 1276 Returns (a copy of) the object or a redirect to the same, although a 1277 sensible object proxy may be substituted. Examples of sensible 1278 substitutes include, 1280 o a table of contents instead of a large complex document, 1281 o a home page instead of an entire web site hierarchy, 1283 o a rights clearance challenge before accessing protected data, 1285 o directions for access to an offline object (e.g., a book), 1287 o a description of an intangible object (a disease, an event), or 1289 o an applet acting as "player" for a large multimedia object. 1291 May also return a discriminated list of alternate object locators. 1292 If access is denied, returns an explanation of the object's current 1293 (perhaps permanent) inaccessibility. 1295 5.1.1. Generic Policy Service (permanence, naming, etc.) 1297 Returns declarations of policy and support commitments for given 1298 ARKs. Declarations are returned in either a structured metadata 1299 format or a human readable text format; sometimes one format may 1300 serve both purposes. Policy subareas may be addressed in separate 1301 requests, but the following areas should be covered: object 1302 permanence, object naming, object fragment addressing, and 1303 operational service support. 1305 The permanence declaration for an object is a rating defined with 1306 respect to an identified permanence provider (guarantor), which will 1307 be the NMA. It may include the following aspects. 1309 (a) "object availability" -- whether and how access to the object 1310 is supported (e.g., online 24x7, or offline only), 1312 (b) "identifier validity" -- under what conditions the identifier 1313 will be or has been re-assigned, 1315 (c) "content invariance" -- under what conditions the content of 1316 the object is subject to change, and 1318 (d) "change history" -- access to corrections, migrations, and 1319 revisions, whether through links to the changed objects themselves 1320 or through a document summarizing the change history 1322 A recent approach to persistence statements, conceived independently 1323 from ARKs, can be found at [PStatements], with ongoing work available 1324 at [ARKagency]. An older approach to a permanence rating framework 1325 is given in [NLMPerm], which identified the following "permanence 1326 levels": 1328 Not Guaranteed: No commitment has been made to retain this 1329 resource. It could become unavailable at any time. Its 1330 identifier could be changed. 1332 Permanent: Dynamic Content: A commitment has been made to keep 1333 this resource permanently available. Its identifier will always 1334 provide access to the resource. Its content could be revised or 1335 replaced. 1337 Permanent: Stable Content: A commitment has been made to keep this 1338 resource permanently available. Its identifier will always 1339 provide access to the resource. Its content is subject only to 1340 minor corrections or additions. 1342 Permanent: Unchanging Content: A commitment has been made to keep 1343 this resource permanently available. Its identifier will always 1344 provide access to the resource. Its content will not change. 1346 Naming policy for an object includes an historical description of the 1347 NAA's (and its successor NAA's) policies regarding differentiation of 1348 objects. Since it is the NMA that responds to requests for policy 1349 statements, it is useful for the NMA to be able to produce or 1350 summarize these historical NAA documents. Naming policy may include 1351 the following aspects. 1353 (i) "similarity" -- (or "unity") the limit, defined by the NAA, to 1354 the level of dissimilarity beyond which two similar objects 1355 warrant separate identifiers but before which they share one 1356 single identifier, and 1358 (ii) "granularity" -- the limit, defined by the NAA, to the level 1359 of object subdivision beyond which sub-objects do not warrant 1360 separately assigned identifiers but before which sub-objects are 1361 assigned separate identifiers. 1363 Subnaming policy for an object describes the qualifiers that the NMA, 1364 in fulfilling its ongoing and evolving service obligations, allows as 1365 extensions to an NAA-assigned ARK. To the conceptual object that the 1366 NAA named with an ARK, the NMA may add component access points and 1367 derivatives (e.g., format migrations in aid of preservation) in order 1368 to provide both basic and value-added services. 1370 Addressing policy for an object includes a description of how, during 1371 access, object components (e.g., paragraphs, sections) or views 1372 (e.g., image conversions) may or may not be "addressed", in other 1373 words, how the NMA permits arguments or parameters to modify the 1374 object delivered as the result of an ARK request. If supported, 1375 these sorts of operations would provide things like byte-ranged 1376 fragment delivery and open-ended format conversions, or any set of 1377 possible transformations that would be too numerous to list or to 1378 identify with separately assigned ARKs. 1380 Operational service support policy includes a description of general 1381 operational aspects of the NMA service, such as after-hours staffing 1382 and trouble reporting procedures. 1384 5.1.2. Generic Description Service 1386 Returns a description of the object. Descriptions are returned in a 1387 structured metadata format, a human-readable text format, or in one 1388 format that serves both purposes (such as human-readable HTML with 1389 embedded machine-readable metadata, or perhaps YAML). A description 1390 must at a minimum answer the who, what, when, and where questions 1391 ("where" being the long-term identifier as opposed to a transient 1392 redirect target) concerning an expression of the object. Standalone 1393 descriptions should be accompanied by the modification date and 1394 source of the description itself. May also return discriminated 1395 lists of ARKs that are related to the given ARK. 1397 5.2. Overview of The HTTP URL Mapping Protocol (THUMP) 1399 The HTTP URL Mapping Protocol (THUMP) is a way of taking a key (any 1400 identifier) and asking such questions as, what information does this 1401 identify and how permanent is it? [THUMP] is in fact one specific 1402 method under development for delivering ARK services. The protocol 1403 runs over HTTP to exploit the web browser's current pre-eminence as 1404 user interface to the Internet. THUMP is designed so that a person 1405 can enter ARK requests directly into the location field of current 1406 browser interfaces. Because it runs over HTTP, THUMP can be 1407 simulated and tested via keyboard-based interactions [RFC0854]. 1409 The asker (a person or client program) starts with an identifier, 1410 such as an ARK or a URL. The identifier reveals to the asker (or 1411 allows the asker to infer) the Internet host name and port number of 1412 a server system that responds to questions. Here, this is just the 1413 NMA that is obtained by inspection and possibly lookup based on the 1414 ARK's NAAN. The asker then sets up an HTTP session with the server 1415 system, sends a question via a THUMP request (contained within an 1416 HTTP request), receives an answer via a THUMP response (contained 1417 within an HTTP response), and closes the session. That concludes the 1418 connected portion of the protocol. 1420 A THUMP request is a string of characters beginning with a `?' 1421 (question mark) that is appended to the identifier string. The 1422 resulting string is sent as an argument to HTTP's GET command. 1423 Request strings too long for GET may be sent using HTTP's POST 1424 command. The two most common requests correspond to two degenerate 1425 special cases. First, a simple key with no request at all is the 1426 same as an ordinary access request. Thus a plain ARK entered into a 1427 browser's location field behaves much like a plain URL, and returns 1428 access to the primary identified object, for instance, an HTML 1429 document. 1431 The second special case is a minimal ARK description request string 1432 consisting of just "?info". For example, entering the string, 1434 n2t.net/ark:67531/metadc107835?info 1436 into the browser's location field directly precipitates a request for 1437 a metadata record describing the object identified by ark:67531/ 1438 metadc107835. The browser, unaware of THUMP, prepares and sends an 1439 HTTP GET request in the same manner as for a URL. THUMP is designed 1440 so that the response (indicated by the returned HTTP content type) is 1441 normally displayed, whether the output is structured for machine 1442 processing (text/plain) or formatted for human consumption (text/ 1443 html). In addition to '?info', this specification reserves both '?' 1444 and '??' (originally older forms) for future use. 1446 The following example THUMP session assumes metadata being returned 1447 by a resolver (as server) to a browser client. Each line has been 1448 annotated to include a line number and whether it was the client or 1449 server that sent it. Without going into much depth, the session has 1450 four pieces separated from each other by blank lines: the client's 1451 piece (lines 1-3), the server's HTTP/THUMP response headers (4-7), 1452 and the body of the server's response (8-13). The first and last 1453 lines (1 and 13) correspond to the client's steps to start the TCP 1454 session and the server's steps to end it, respectively. 1456 1 C: [opens session] 1457 C: GET https://n2t.net/ark:67531/metadc107835?info HTTP/1.1 1458 C: 1459 S: HTTP/1.1 200 OK 1460 5 S: Content-Type: text/plain 1461 S: THUMP-Status: 0.6 200 OK 1462 S: 1463 S: erc: 1464 S: who: Austin, Larry 1465 10 S: what: A Study of Rhythm in Bach's Orgelbuechlein 1466 S: when: 1952 1467 S: where: https://digital.library.unt.edu/ark:/67531/metadc107835 1468 S: erc-support: 1469 S: who: University of North Texas Libraries 1470 15 S: what: Permanent: Stable Content: 1471 S: when: 20081203 1472 S: where: https://digital.library.unt.edu/ark:/67531/ 1473 S: [closes session] 1475 The first two server response lines (4-5) above are typical of HTTP. 1476 The next line (6) is peculiar to THUMP, and indicates the THUMP 1477 version and a normal return status. 1479 The balance of the response consists of a single metadata record 1480 (8-17) that comprises the ARK description service response. The 1481 returned record is in the format of an Electronic Resource Citation 1482 [ERC], which is discussed in overview in the next section. For now, 1483 note that it contains four elements that answer the top priority 1484 questions regarding an expression of the object: who played a major 1485 role in expressing it, what the expression was called, when it was 1486 created, and where the expression may be found (note that "where" is 1487 preferably a persistent, citable identifier rather than an unstable 1488 URL sometimes mistakenly referred to as a "location"). This quartet 1489 of elements comes up again and again in ERCs. Lines 13-17 contain a 1490 minimal persistence statement. 1492 Each segment in an ERC tells a different story relating to the 1493 object, so although the same four questions (elements) appear in 1494 each, the answers depend on the segment's story type. While the 1495 first segment tells the story of an expression of the object, the 1496 second segment tells the story of the support commitment made to it: 1497 who made the commitment, what the nature of the commitment was, when 1498 it was made, and where a fuller explanation of the commitment may be 1499 found. 1501 5.3. The Electronic Resource Citation (ERC) 1503 An Electronic Resource Citation (or ERC, pronounced e-r-c) [ERC] is a 1504 kind of object description that uses Dublin Core Kernel metadata 1505 elements [DCKernel]. The ERC with Kernel elements provides a simple, 1506 compact, and printable record for holding data associated with an 1507 information resource. As originally designed [Kernel], Kernel 1508 metadata balances the needs for expressive power, very simple machine 1509 processing, and direct human manipulation. The ERC sense of 1510 "citation" is not limited to the traditional referencing of a result 1511 or information fixed in time on a printed page, but to a more general 1512 kind of reference, both backward, to digital material that cannot be 1513 known to be fixed in time (true of virtually all online information), 1514 and forward, to material that is all the more valuable for improving 1515 or evolving over time. 1517 The previous section shows two limited examples of what is fully 1518 described elsewhere [ERC]. The rest of this short section provides 1519 some of the background and rationale for this record format. 1521 A founding principle of Kernel metadata is that direct human contact 1522 with metadata will be a necessary and sufficient condition for the 1523 near term rapid development of metadata standards, systems, and 1524 services. Thus the machine-processable Kernel elements must only 1525 minimally strain people's ability to read, understand, change, and 1526 transmit ERCs without their relying on intermediation with 1527 specialized software tools. The basic ERC needs to be succinct, 1528 transparent, and trivially parseable by software. 1530 Borrowing from the data structuring format that underlies the 1531 successful spread of email and web services, the ERC format uses 1532 [ANVL], which is based on email and HTTP headers [RFC2822]. There is 1533 a naturalness to ANVL's label-colon-value format (seen in the 1534 previous section) that barely needs explanation to a person beginning 1535 to enter ERC metadata. 1537 While ANVL elements are expected at the top level and don't 1538 themselves support hierarchy, the value of an ANVL element may be an 1539 arbitrary encoded hierarchy of JSON or XML. Typically, the name of 1540 such an ANVL element ends in "json" or "xml", for example, "json" or 1541 "geojson". Care should be taken to escape structural characters that 1542 appear in element names and values, specifically, line terminators 1543 (both newlines ("\n") and carriage returns ("\r")) and, in element 1544 names, colons (":"). 1546 Besides simplicity of ERC system implementation and data entry 1547 mechanics, ERC semantics (what the record and its constituent parts 1548 mean) must also be easy to explain. ERC semantics are based on a 1549 reformulation and extension of the Dublin Core [RFC5013] hypothesis, 1550 which suggests that the fifteen Dublin Core metadata elements have a 1551 key role to play in cross-domain resource description. The ERC 1552 design recognizes that the Dublin Core's primary contribution is the 1553 international, interdisciplinary consensus that identified fifteen 1554 semantic buckets (element categories), regardless of how they are 1555 labeled. The ERC then adds a definition for a record and some 1556 minimal compliance rules. In pursuing the limits of simplicity, the 1557 ERC design combines and relabels some Dublin Core buckets to isolate 1558 a tiny kernel (subset) of four elements for basic cross-domain 1559 resource description. 1561 For the cross-domain kernel, the ERC uses the four basic elements -- 1562 who, what, when, and where -- to pretend that every object in the 1563 universe can have a uniform minimal description. Each has a name or 1564 other identifier, a locator (a means to access it), some responsible 1565 person or party, and a date. It doesn't matter what type of object 1566 it is, or whether one plans to read it, interact with it, smoke it, 1567 wear it, or navigate it. Of course, this approach is flawed because 1568 uniformity of description for some object types requires more 1569 semantic contortion and sacrifice than for others. That is why at 1570 the beginning of this document, the ARK was said to be suited to 1571 objects that accommodate reasonably regular electronic description. 1573 While insisting on uniformity at the most basic level provides 1574 powerful cross-domain leverage, the semantic sacrifice is great for 1575 many applications. So the ERC also permits a semantically rich and 1576 nuanced description to co-exist in a record along with a basic 1577 description. In that way both sophisticated and naive recipients of 1578 the record can extract the level of meaning from it that best suits 1579 their needs and abilities. Key to unlocking the richer description 1580 is a controlled vocabulary of ERC record types (not explained in this 1581 document) that permit knowledgeable recipients to apply defined sets 1582 of additional assumptions to the record. 1584 5.4. Advice to Web Clients 1586 ARKs are envisaged to appear wherever durable object references are 1587 planned. Library cataloging records, literature citations, and 1588 bibliographies are important examples. In many of these places URLs 1589 (Uniform Resource Locators) are currently used, and inside some of 1590 those URLs are embedded URNs, Handles, and DOIs. Unfortunately, 1591 there's no suggestion of a way to probe for extra services that would 1592 build confidence in those identifiers; in other words, there's no way 1593 to tell whether any of those identifiers is any better managed than 1594 the average URL. 1596 ARKs are also envisaged to appear in hypertext links (where they are 1597 not normally shown to users) and in rendered text (displayed or 1598 printed). A normal HTML link for which the URL is not displayed 1599 looks like this. 1601 Click Here 1603 A URL with an embedded ARK invites access (via `?info') to extra 1604 services: 1606 Click Here 1608 Using the [N2T] resolver to provide identifier-scheme-agnostic 1609 protection against hostname instability, this ARK could be published 1610 as: 1612 Click Here 1614 An NAA will typically make known the associations it creates by 1615 publishing them in catalogs, actively advertizing them, or simply 1616 leaving them on web sites for visitors (e.g., users, indexing 1617 spiders) to stumble across in browsing. 1619 5.5. Security Considerations 1621 The ARK naming scheme poses no direct risk to computers and networks. 1622 Implementors of ARK services need to be aware of security issues when 1623 querying networks and filesystems for Name Mapping Authority 1624 services, and the concomitant risks from spoofing and obtaining 1625 incorrect information. These risks are no greater for ARK mapping 1626 authority discovery than for other kinds of service discovery. For 1627 example, recipients of ARKs with a specified host (NMA) should treat 1628 it like a URL and be aware that the identified ARK service may no 1629 longer be operational. 1631 Apart from mapping authority discovery, ARK clients and servers 1632 subject themselves to all the risks that accompany normal operation 1633 of the protocols underlying mapping services (e.g., HTTP, Z39.50). 1634 As specializations of such protocols, an ARK service may limit 1635 exposure to the usual risks. Indeed, ARK services may enhance a kind 1636 of security by helping users identify long-term reliable references 1637 to information objects. 1639 6. References 1641 [ANVL] Kunze, J., Kahle, B., Masanes, J., and G. Mohr, "A Name- 1642 Value Language", 2005, 1643 . 1645 [ARK] Kunze, J., "Towards Electronic Persistence Using ARK 1646 Identifiers", IWAW/ECDL Annual Workshop Proceedings 3rd, 1647 August 2003, . 1649 [ARKagency] 1650 Alliance, A., "ARK Maintenance Agency", 2021, 1651 . 1653 [DCKernel] 1654 Initiative, D. C. M., "Kernel Metadata Working Group", 1655 2001-2008, . 1657 [DOI] Foundation, I. D., "The Digital Object Identifier (DOI) 1658 System", February 2001, . 1660 [ERC] Kunze, J. and A. Turner, "Kernel Metadata and Electronic 1661 Resource Citations", October 2007, 1662 . 1664 [Handle] Lannom, L., "Handle System Overview", ICSTI Forum No. 30, 1665 April 1999, . 1667 [Kernel] Kunze, J., "A Metadata Kernel for Electronic Permanence", 1668 Journal of Digital Information Vol 2, Issue 2, 1669 ISSN 1368-7506, January 2002, 1670 . 1672 [N2T] Alliance, A., "Name-to-Thing Resolver", August 2006, 1673 . 1675 [NAANregistry] 1676 ARKs.org, "NAAN Registry", 2019, 1677 . 1679 [NAANrequest] 1680 ARKs.org, "NAAN Request Form", 2018, 1681 . 1683 [NLMPerm] Byrnes, M., "Permanence Levels and the Archives for NLM's 1684 Permanent Web Documents", March 2005, 1685 . 1688 [NOID] Kunze, J., "Nice Opaque Identifiers", April 2006, 1689 . 1691 [PStatements] 1692 Kunze, J., "Persistence statements: describing digital 1693 stickiness", October 2016, 1694 . 1696 [PURL] Shafer, K., "Introduction to Persistent Uniform Resource 1697 Locators", 1996, 1698 . 1701 [RFC0854] Postel, J. and J. Reynolds, "Telnet Protocol 1702 Specification", STD 8, RFC 854, DOI 10.17487/RFC0854, May 1703 1983, . 1705 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 1706 STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987, 1707 . 1709 [RFC2141] Moats, R., "URN Syntax", RFC 2141, DOI 10.17487/RFC2141, 1710 May 1997, . 1712 [RFC2288] Lynch, C., Preston, C., and R. Daniel, "Using Existing 1713 Bibliographic Identifiers as Uniform Resource Names", 1714 RFC 2288, DOI 10.17487/RFC2288, February 1998, 1715 . 1717 [RFC2611] Daigle, L., van Gulik, D., Iannella, R., and P. Faltstrom, 1718 "URN Namespace Definition Mechanisms", BCP 33, RFC 2611, 1719 DOI 10.17487/RFC2611, June 1999, 1720 . 1722 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 1723 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 1724 Transfer Protocol -- HTTP/1.1", RFC 2616, 1725 DOI 10.17487/RFC2616, June 1999, 1726 . 1728 [RFC2822] Resnick, P., Ed., "Internet Message Format", RFC 2822, 1729 DOI 10.17487/RFC2822, April 2001, 1730 . 1732 [RFC2915] Mealling, M. and R. Daniel, "The Naming Authority Pointer 1733 (NAPTR) DNS Resource Record", RFC 2915, 1734 DOI 10.17487/RFC2915, September 2000, 1735 . 1737 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1738 Resource Identifier (URI): Generic Syntax", STD 66, 1739 RFC 3986, DOI 10.17487/RFC3986, January 2005, 1740 . 1742 [RFC5013] Kunze, J. and T. Baker, "The Dublin Core Metadata Element 1743 Set", RFC 5013, DOI 10.17487/RFC5013, August 2007, 1744 . 1746 [THUMP] Gamiel, K. and J. Kunze, "The HTTP URL Mapping Protocol", 1747 August 2007, . 1750 Appendix A. ARK Maintenance Agency: arks.org 1752 The ARK Maintenance Agency [ARKagency] at arks.org has several 1753 functions. 1755 o To manage the registry of organizations that will be assigning 1756 ARKs. Organizations can request or update a NAAN by filling out a 1757 form [NAANrequest]. 1759 o To be a clearinghouse for information about ARKs, such as best 1760 practices, introductory documentation, tutorials, community 1761 forums, etc. These supplemental resources help ARK implementor in 1762 high-level applications across different sectors and disciplines, 1763 and with a variety of metadata standards. 1765 o To be a locus of discussion about future versions of the ARK 1766 specification. 1768 Appendix B. Looking up NMAs Distributed via DNS 1770 This subsection introduces an older method for looking up NMAs that 1771 is based on the method for discovering URN resolvers described in 1772 [RFC2915]. It relies on querying the DNS system already installed in 1773 the background infrastructure of most networked computers. A query 1774 is submitted to DNS asking for a list of resolvers that match a given 1775 NAAN. DNS distributes the query to the particular DNS servers that 1776 can best provide the answer, unless the answer can be found more 1777 quickly in a local DNS cache as a side-effect of a recent query. 1778 Responses come back inside Name Authority Pointer (NAPTR) records. 1779 The normal result is one or more candidate NMAs. 1781 In its full generality the [RFC2915] algorithm ambitiously 1782 accommodates a complex set of preferences, orderings, protocols, 1783 mapping services, regular expression rewriting rules, and DNS record 1784 types. This subsection proposes a drastic simplification of it for 1785 the special case of ARK mapping authority discovery. The simplified 1786 algorithm is called Maptr. It uses only one DNS record type (NAPTR) 1787 and restricts most of its field values to constants. The following 1788 hypothetical excerpt from a DNS data file for the NAAN known as 12026 1789 shows three example NAPTR records ready to use with the Maptr 1790 algorithm. 1792 12026.ark.arpa. 1793 ;; US Library of Congress 1794 ;; order pref flags service regexp replacement 1795 IN NAPTR 0 0 "h" "ark" "USLC" lhc.nlm.nih.gov:8080 1796 IN NAPTR 0 0 "h" "ark" "USLC" foobar.zaf.org 1797 IN NAPTR 0 0 "h" "ark" "USLC" sneezy.dopey.com 1799 All the fields are held constant for Maptr except for the "flags", 1800 "regexp", and "replacement" fields. The "service" field contains the 1801 constant value "ark" so that NAPTR records participating in the Maptr 1802 algorithm will not be confused with other NAPTR records. The "order" 1803 and "pref" fields are held to 0 (zero) and otherwise ignored for now; 1804 the algorithm may evolve to use these fields for ranking decisions 1805 when usage patterns and local administrative needs are better 1806 understood. 1808 When a Maptr query returns a record with a flags field of "h" (for 1809 host, a Maptr extension to the NAPTR flags), the replacement field 1810 contains the NMA (host) of an ARK service provider. When a query 1811 returns a record with a flags field of "" (the empty string), the 1812 client needs to submit a new query containing the domain name found 1813 in the replacement field. This second sort of record exploits the 1814 distributed nature of DNS by redirecting the query to another domain 1815 name. It looks like this. 1817 12345.ark.arpa. 1818 ;; Digital Library Consortium 1819 ;; order pref flags service regexp replacement 1820 IN NAPTR 0 0 "" "ark" "" dlc.spct.org. 1822 Here is the Maptr algorithm for ARK mapping authority discovery. In 1823 it replace with the NAAN from the ARK for which an NMA is 1824 sought. 1826 1. Initialize the DNS query: type=NAPTR, query=.ark.arpa. 1828 2. Submit the query to DNS and retrieve (NAPTR) records, discarding 1829 any record that does not have "ark" for the service field. 1831 3. All remaining records with a flags fields of "h" contain 1832 candidate NMAs in their replacement fields. Set them aside, if 1833 any. 1835 4. Any record with an empty flags field ("") has a replacement field 1836 containing a new domain name to which a subsequent query should 1837 be redirected. For each such record, set query= 1838 then go to step (2). When all such records have been recursively 1839 exhausted, go to step (5). 1841 5. All redirected queries have been resolved and a set of candidate 1842 NMAs has been accumulated from steps (3). If there are zero 1843 NMAs, exit -- no mapping authority was found. If there is one or 1844 more NMA, choose one using any criteria you wish, then exit. 1846 A Perl script that implements this algorithm is included here. 1848 #!/depot/bin/perl 1850 use Net::DNS; # include simple DNS package 1851 my $qtype = "NAPTR"; # initialize query type 1852 my $naa = shift; # get NAAN script argument 1853 my $mad = new Net::DNS::Resolver; # mapping authority discovery 1855 &maptr("$naa.ark.arpa"); # call maptr - that's it 1857 sub maptr { # recursive maptr algorithm 1858 my $dname = shift; # domain name as argument 1859 my ($rr, $order, $pref, $flags, $service, $regexp, 1860 $replacement); 1861 my $query = $mad->query($dname, $qtype); 1862 return # non-productive query 1863 if (! $query || ! $query->answer); 1864 foreach $rr ($query->answer) { 1865 next # skip records of wrong type 1866 if ($rr->type ne $qtype); 1867 ($order, $pref, $flags, $service, $regexp, 1868 $replacement) = split(/\s/, $rr->rdatastr); 1869 if ($flags eq "") { 1870 &maptr($replacement); # recurse 1871 } elsif ($flags eq "h") { 1872 print "$replacement\n"; # candidate NMA 1873 } 1874 } 1875 } 1876 The global database thus distributed via DNS and the Maptr algorithm 1877 can easily be seen to mirror the contents of the Name Authority 1878 Table file described in the previous section. 1880 Authors' Addresses 1882 John A. Kunze 1883 California Digital Library 1884 1111 Franklin Street 1885 Oakland, CA 94607 1886 USA 1888 Email: jak@ucop.edu 1890 Emmanuelle Bermes 1891 Bibliotheque nationale de France 1892 Quai Francois Mauriac 1893 Paris 75706 1894 France 1896 Email: emmanuelle.bermes@bnf.fr