idnits 2.17.1 draft-kunze-ark-29.txt: -(4): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(5): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There are 6 instances of lines with non-ascii characters in the document. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([Qualifiers]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1947 has weird spacing: '... regexp repla...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (9 October 2021) is 929 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'Qualifiers' is mentioned on line 583, but not defined ** Obsolete normative reference: RFC 2141 (Obsoleted by RFC 8141) ** Obsolete normative reference: RFC 2611 (Obsoleted by RFC 3406) ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Obsolete normative reference: RFC 2822 (Obsoleted by RFC 5322) ** Obsolete normative reference: RFC 2915 (Obsoleted by RFC 3401, RFC 3402, RFC 3403, RFC 3404) Summary: 8 errors (**), 0 flaws (~~), 5 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Kunze 3 Internet-Draft California Digital Library 4 Intended status: Informational E. Bermès 5 Expires: 12 April 2022 Bibliothèque nationale de France 6 9 October 2021 8 The ARK Identifier Scheme 9 draft-kunze-ark-29 11 Abstract 13 The ARK (Archival Resource Key) naming scheme is designed to 14 facilitate the high-quality and persistent identification of 15 information objects. A founding principle of the ARK is that 16 persistence is purely a matter of service and is neither inherent in 17 an object nor conferred on it by a particular naming syntax. The 18 best that an identifier can do is to lead users to the services that 19 support robust reference. The term ARK itself refers both to the 20 scheme and to any single identifier that conforms to it. An ARK has 21 five components: 23 [https://NMA/]ark:[/]NAAN/Name[Qualifiers] 25 an optional and mutable Name Mapping Authority (usually a hostname), 26 the "ark:" label, the Name Assigning Authority Number (NAAN), the 27 assigned Name, and an optional and possibly mutable Qualifier 28 supported by the NMA. The NAAN and Name together form the immutable 29 persistent identifier for the object independent of the URL hostname. 30 An ARK is a special kind of URL that connects users to three things: 31 the named object, its metadata, and the provider's promise about its 32 persistence. When entered into the location field of a Web browser, 33 the ARK leads the user to the named object. That same ARK, inflected 34 by appending `?info', returns a metadata record that is both human- 35 and machine-readable. The returned record contains core metadata and 36 a commitment statement from the current provider. Tools exist for 37 minting, binding, and resolving ARKs. 39 Status of This Memo 41 This Internet-Draft is submitted in full conformance with the 42 provisions of BCP 78 and BCP 79. 44 Internet-Drafts are working documents of the Internet Engineering 45 Task Force (IETF). Note that other groups may also distribute 46 working documents as Internet-Drafts. The list of current Internet- 47 Drafts is at https://datatracker.ietf.org/drafts/current/. 49 Internet-Drafts are draft documents valid for a maximum of six months 50 and may be updated, replaced, or obsoleted by other documents at any 51 time. It is inappropriate to use Internet-Drafts as reference 52 material or to cite them other than as "work in progress." 54 This Internet-Draft will expire on 12 April 2022. 56 Copyright Notice 58 Copyright (c) 2021 IETF Trust and the persons identified as the 59 document authors. All rights reserved. 61 This document is subject to BCP 78 and the IETF Trust's Legal 62 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 63 license-info) in effect on the date of publication of this document. 64 Please review these documents carefully, as they describe your rights 65 and restrictions with respect to this document. 67 Table of Contents 69 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 70 1.1. Reasons to Use ARKs . . . . . . . . . . . . . . . . . . . 4 71 1.2. Three Requirements of ARKs . . . . . . . . . . . . . . . 5 72 1.3. Organizing Support for ARKs: Our Stuff vs. Their Stuff . 7 73 1.4. Definition of Identifier . . . . . . . . . . . . . . . . 8 74 2. ARK Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . 9 75 2.1. The Name Mapping Authority (NMA) . . . . . . . . . . . . 11 76 2.2. The ARK Label Part (ark:) . . . . . . . . . . . . . . . . 12 77 2.3. The Name Assigning Authority Number (NAAN) . . . . . . . 13 78 2.4. The Name Part . . . . . . . . . . . . . . . . . . . . . . 15 79 2.4.1. Optional: Shoulders . . . . . . . . . . . . . . . . . 15 80 2.5. The Qualifier Part . . . . . . . . . . . . . . . . . . . 17 81 2.5.1. ARKs that Reveal Object Hierarchy . . . . . . . . . . 18 82 2.5.2. ARKs that Reveal Object Variants . . . . . . . . . . 19 83 2.6. Character Repertoires . . . . . . . . . . . . . . . . . . 21 84 2.7. Normalization and Lexical Equivalence . . . . . . . . . . 22 85 2.8. Resolver Chains . . . . . . . . . . . . . . . . . . . . . 23 86 3. Naming Considerations . . . . . . . . . . . . . . . . . . . . 24 87 3.1. ARKS and Usability . . . . . . . . . . . . . . . . . . . 24 88 3.2. Objects Should Wear Their Identifiers . . . . . . . . . . 24 89 3.3. Names are Political, not Technological . . . . . . . . . 24 90 3.4. Choosing a Hostname or NMA . . . . . . . . . . . . . . . 25 91 3.5. Assigners of ARKs . . . . . . . . . . . . . . . . . . . . 26 92 3.6. NAAN Namespace Management . . . . . . . . . . . . . . . . 27 93 3.7. Sub-Object Naming . . . . . . . . . . . . . . . . . . . . 28 94 4. Finding a Name Mapping Authority . . . . . . . . . . . . . . 29 95 4.1. Looking Up NMAs in a Globally Accessible File . . . . . . 30 96 5. Generic ARK Service Definition . . . . . . . . . . . . . . . 31 97 5.1. Generic ARK Access Service (access, location) . . . . . . 31 98 5.1.1. Generic Policy Service (permanence, naming, etc.) . . 32 99 5.1.2. Generic Description Service . . . . . . . . . . . . . 34 100 5.2. Overview of The HTTP URL Mapping Protocol (THUMP) . . . . 34 101 5.3. The Electronic Resource Citation (ERC) . . . . . . . . . 36 102 5.4. Advice to Web Clients . . . . . . . . . . . . . . . . . . 38 103 5.5. Enhancements and Related Specifications . . . . . . . . . 39 104 5.6. Security Considerations . . . . . . . . . . . . . . . . . 39 105 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 39 106 Appendix A. ARK Maintenance Agency: arks.org . . . . . . . . . . 42 107 Appendix B. Looking up NMAs Distributed via DNS . . . . . . . . 42 108 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 44 110 1. Introduction 112 [ Note about this transitional draft. The ARK Alliance Technical 113 Working Group (https://wiki.lyrasis.org/display/ARKs/ 114 Technical+Working+Group) is in the process of revising the ARK spec 115 via a series of Internet-Drafts. This draft contains many minor but 116 noisy changes (lots of diffs but not much real change). While the 117 spec is in transition, new implementors should follow 118 https://datatracker.ietf.org/doc/html/draft-kunze-ark-18. ] 120 This document describes a scheme for the high-quality naming of 121 information resources. The scheme, called the Archival Resource Key 122 (ARK), is well suited to long-term access and identification of any 123 information resources that accommodate reasonably regular electronic 124 description. This includes digital documents, databases, software, 125 and websites, as well as physical objects (books, bones, statues, 126 etc.) and intangible objects (chemicals, diseases, vocabulary terms, 127 performances). Hereafter the term "object" refers to an information 128 resource. The term ARK itself refers both to the scheme and to any 129 single identifier that conforms to it. A reasonably concise and 130 accessible overview and rationale for the scheme is available at 131 [ARK]. 133 Schemes for persistent identification of network-accessible objects 134 are not new. In the early 1990's, the design of the Uniform Resource 135 Name [RFC2141] responded to the observed failure rate of URLs by 136 articulating an indirect, non-hostname-based naming scheme and the 137 need for responsible name management. Meanwhile, promoters of the 138 Digital Object Identifier [DOI] succeeded in building a community of 139 providers around a mature software system [Handle] that supports name 140 management. The Persistent Uniform Resource Locator [PURL] was 141 another scheme that had the advantage of working with unmodified web 142 browsers. ARKs represent an approach that attempts to build on the 143 strengths and to avoid the weaknesses of these schemes. 145 A founding principle of the ARK is that persistence is purely a 146 matter of service. Persistence is neither inherent in an object nor 147 conferred on it by a particular naming syntax. Nor is the technique 148 of name indirection -- upon which URNs, Handles, DOIs, and PURLs are 149 founded -- of central importance. Name indirection is an ancient and 150 well-understood practice; new mechanisms for it keep appearing and 151 distracting practitioner attention, with the Domain Name System (DNS) 152 [RFC1034] being a particularly dazzling and elegant example. What is 153 often forgotten is that maintenance of an indirection table is an 154 unavoidable cost to the organization providing persistence, and that 155 cost is equivalent across naming schemes. That indirection has 156 always been a native part of the web while being so lightly utilized 157 for the persistence of web-based objects indicates how unsuited most 158 organizations will probably be to the task of table maintenance and 159 to the much more fundamental challenge of keeping the objects 160 themselves viable. 162 Persistence is achieved through a provider's successful stewardship 163 of objects and their identifiers. The highest level of persistence 164 will be reinforced by a provider's robust contingency, redundancy, 165 and succession strategies. It is further safeguarded to the extent 166 that a provider's mission is shielded from funding and political 167 instabilities. These are by far the major challenges confronting 168 persistence providers, and no identifier scheme has any direct impact 169 on them. In fact, some schemes may actually be liabilities for 170 persistence because they create short- and long-term dependencies for 171 every object access on complex, special-purpose infrastructures, 172 parts of which are proprietary and all of which increase the carry- 173 forward burden for the preservation community. It is for this reason 174 that the ARK scheme relies only on educated name assignment and light 175 use of general-purpose infrastructures that are maintained mostly by 176 the Internet community at large (the DNS, web servers, and web 177 browsers). 179 1.1. Reasons to Use ARKs 181 If no persistent identifier scheme contributes directly to 182 persistence, why not just use URLs? A particular URL may be as 183 durable an identifier as it is possible to have, but nothing 184 distinguishes it from an ordinary URL to the recipient who is 185 wondering if it is suitable for long-term reference. An ARK embedded 186 in a URL provides some of the necessary conditions for credible 187 persistence, inviting access to not one, but to three things: to the 188 object, to its metadata, and to a nuanced statement of commitment 189 from the provider in question (the NMA, described below) regarding 190 the object. Existence of the extra service can be probed 191 automatically by appending `?info' to the ARK. 193 The form of the ARK also supports the natural separation of naming 194 authorities into the original name assigning authority and the 195 diverse multiple name mapping (or servicing) authorities that in 196 succession and in parallel will take over custodial responsibilities 197 from the original assigner (assuming the assigner ever held that 198 responsibility) for the large majority of a long-term object's 199 archival lifetime. The name mapping authority, indicated by the 200 hostname part of the URL that contains the ARK, serves to launch the 201 ARK into cyberspace. Should it ever fail (and there is no reason why 202 a well-chosen hostname for a 100-year-old cultural memory institution 203 shouldn't last as long as the DNS), that host name is considered 204 disposeable and replaceable. Again, the form of the ARK helps 205 because it defines exactly how to recover the core immutable object 206 identity, and simple algorithms (one based on the URN model) or even 207 by-hand Internet query can be used for for locating another mapping 208 authority. 210 There are tools to assist in generating ARKs and other identifiers, 211 such as [NOID] and "uuidgen", both of which rely for uniqueness on 212 human-maintained registries. This document also contains some 213 guidelines and considerations for managing namespaces and choosing 214 hostnames with persistence in mind. 216 1.2. Three Requirements of ARKs 218 The first requirement of an ARK is to give users a link from an 219 object to a promise of stewardship for it. That promise is a multi- 220 faceted covenant that binds the word of an identified service 221 provider to a specific set of responsibilities. It is critical for 222 the promise to come from a current provider and almost irrelevant, 223 over a long period of time, what the original assigner's intentions 224 were. No one can tell if successful stewardship will take place 225 because no one can predict the future. Reasonable conjecture, 226 however, may be based on past performance. There must be a way to 227 tie a promise of persistence to a provider's demonstrated or 228 perceived ability -- its reputation -- in that arena. Provider 229 reputations would then rise and fall as promises are observed 230 variously to be kept and broken. This is perhaps the best way we 231 have for gauging the strength of any persistence promise. 233 The second requirement of an ARK is to give users a link from an 234 object to a description of it. The problem with a naked identifier 235 is that without a description real identification is incomplete. 236 Identifiers common today are relatively opaque, though some contain 237 ad hoc clues reflecting assertions that were briefly true, such as 238 where in a filesystem hierarchy an object lived during a short stay. 239 Possession of both an identifier and an object is some improvement, 240 but positive identification may still be uncertain since the object 241 itself might not include a matching identifier or might not carry 242 evidence obvious enough to reveal its identity without significant 243 research. In either case, what is called for is a record bearing 244 witness to the identifier's association with the object, as supported 245 by a recorded set of object characteristics. This descriptive record 246 is partly an identification "receipt" with which users and archivists 247 can verify an object's identity after brief inspection and a 248 plausible match with recorded characteristics such as title and size. 250 The final requirement of an ARK is to give users a link to the object 251 itself (or to a copy) if at all possible. Persistent identification 252 plays a vital supporting role but, strictly speaking, it can be 253 construed as no more than a record attesting to the original 254 assignment of a never-reassigned identifier. Object access may not 255 be feasible for various reasons, such as a transient service outage, 256 a catastrophic loss, a licensing agreement that keeps an archive 257 "dark" for a period of years, or when an object's own lack of 258 tangible existence confuses normal concepts of access (e.g., a 259 vocabulary term might be "accessed" through its definition). In such 260 cases the ARK's identification role assumes a much higher profile. 261 But attempts to simplify the persistence problem by decoupling access 262 from identification and concentrating exclusively on the latter are 263 of questionable utility. A perfect system for assigning forever 264 unique identifiers might be created, but if it did so without 265 reducing access failure rates, no one would be interested. The 266 central issue -- which may be crudely summed up as the "HTTP 404 Not 267 Found" problem -- would not have been addressed. 269 The central duty of an ARK is a high-quality experience of access and 270 identification. This means supporting reliable access during the 271 period described in its stewardship promise and, failing that, 272 supporting reliable access to a record describing the thing the ARK 273 is associated with. 275 ARK resolvers must support the `?info' inflection for requesting 276 metadata. Older versions of this specification distinguished between 277 two minimal inflections: `?' (brief metadata) and `??' (more 278 metadata). While these older inflections are still reserved, because 279 they have proven hard to recognize in some environments, supporting 280 them is optional. 282 1.3. Organizing Support for ARKs: Our Stuff vs. Their Stuff 284 An organization and the user community it serves can often be seen to 285 struggle with two different areas of persistent identification: the 286 Our Stuff problem and the Their Stuff problem. In the Our Stuff 287 problem, we in the organization want our own objects to acquire 288 persistent names. Since we possess or control these objects, our 289 organization tackles the Our Stuff problem directly. Whether or not 290 the objects are named by ARKs, our organization is the responsible 291 party, so it can plan for, maintain, and make commitments about the 292 objects. 294 In the Their Stuff problem, we in the organization want others' 295 objects to acquire persistent names. These are objects that we do 296 not own or control, but some of which are critically important to us. 297 But because they are beyond our influence as far as support is 298 concerned, creating and maintaining persistent identifiers for Their 299 Stuff is not especially purposeful or feasible for us to engage in. 300 There is little that we can do about someone else's stuff except 301 encourage their uptake or adoption of persistence services. 303 Co-location of persistent access and identification services is 304 natural. Any organization that undertakes ongoing support of true 305 persistent identification (which includes description) is well-served 306 if it controls, owns, or otherwise has clear internal access to the 307 identified objects, and this gives it an advantage if it wishes also 308 to support persistent access to outsiders. Conversely, persistent 309 access to outsiders requires orderly internal collection management 310 procedures that include monitoring, acquisition, verification, and 311 change control over objects, which in turn requires object 312 identifiers persistent enough to support auditable record keeping 313 practices. 315 Although organizing ARK support under one roof thus tends to make 316 sense, object hosting can successfully be separated from name 317 mapping. An example is when a name mapping authority centrally 318 provides uniform resolution services via a protocol gateway on behalf 319 of organizations that host objects behind a variety of access 320 protocols. It is also reasonable to build value-added description 321 services that rely on the underlying services of a set of mapping 322 authorities. 324 Supporting ARKs is not for every organization. By requiring 325 specific, revealed commitments to preservation, to object access, and 326 to description, the bar for providing ARK services is higher than for 327 some other identifier schemes. On the other hand, it would be hard 328 to grant credence to a persistence promise from an organization that 329 could not muster the minimum ARK services. Not that there isn't a 330 business model for an ARK-like, description-only service built on top 331 of another organization's full complement of ARK services. For 332 example, there might be competition at the description level for 333 abstracting and indexing a body of scientific literature archived in 334 a combination of open and fee-based repositories. The description- 335 only service would have no direct commitment to the objects, but 336 would act as an intermediary, forwarding commitment statements from 337 object hosting services to requestors. 339 1.4. Definition of Identifier 341 An identifier is not a string of character data -- an identifier is 342 an association between a string of data and an object. This 343 abstraction is necessary because without it a string is just data. 344 It's nonsense to talk about a string's breaking, or about its being 345 strong, maintained, and authentic. But as a representative of an 346 association, a string can do, metaphorically, the things that we 347 expect of it. 349 Without regard to whether an object is physical, digital, or 350 conceptual, to identify it is to claim an association between it and 351 a representative string, such as "Jane" or "ISBN 0596000278". What 352 gives a claim credibility is a set of verifiable assertions, or 353 metadata, about the object, such as age, height, title, or number of 354 pages. In other words, the association is made manifest by a record 355 (e.g., a cataloging or other metadata record) that vouches for it. 357 In the complete absence of any testimony (metadata) regarding an 358 association, a would-be identifier string is a meaningless sequence 359 of characters. To keep an externally visible but otherwise internal 360 string from being perceived as an identifier by outsiders, for 361 example, it suffices for an organization not to disclose the nature 362 of its association. For our immediate purpose, actual existence of 363 an association record is more important than its authenticity or 364 verifiability, which are outside the scope of this specification. 366 It is a gift to the identification process if an object carries its 367 own name as an inseparable part of itself, such as an identifier 368 imprinted on the first page of a document or embedded in a data 369 structure element of a digital document header. In cases where the 370 object is large, unwieldy, or unavailable (such as when licensing 371 restrictions are in effect), a metadata record that includes the 372 identifier string will usually suffice. That record becomes a 373 conveniently manipulable object surrogate, acting as both an 374 association "receipt" and "declaration". 376 Note that our definition of identifier extends the one in use for 377 Uniform Resource Identifiers [RFC3986]. The present document still 378 sometimes (ab)uses the terms "ARK" and "identifier" as shorthand for 379 the string part of an identifier, but the context should make the 380 meaning clear. 382 2. ARK Anatomy 384 An ARK is represented by a sequence of characters (a string) that 385 contains the Label, "ark:", optionally preceded by the beginning part 386 of a URL. Here is a diagrammed example. 388 ANATOMY OVERIEW 389 =============== 391 Resolver Service Compact ARK 392 __________________ ______________________________ 393 / \/ \ 394 https://example.org/ark:12345/x6np1wh8k/c3/s5.v7.xsl 395 \___________________________/\________/\___________/ 396 Prefixes Base Name Suffixes 397 \__________________________________________________/ 398 Mapping ARK 400 When embedded in a URL, an ARK consists of a Compact ARK preceded by 401 a Resolver Service. The larger URL-based ARK is known as a Mapping 402 ARK because it is ready to be mapped (resolved) to an information 403 response (eg, a PDF or metadata). A Mapping ARK is also know as a 404 "fully qualified ARK". The Resolver Service, which need not be 405 limited to URLs in the future, maps the URL according to rules and 406 abilities of an NMA (Name Mapping Authority). The same URL string 407 minus the Resolver Service component is known as a Compact ARK. The 408 Compact ARK is globally unique and may be resolvable via different 409 Resolver Services over time (eg, when one archive succeeds another) 410 or at the same time (eg, when one archive backs up another). 412 At a high level, after the Label comes the NAAN (Name Assigning 413 Authority Number) followed by the Name that it assigns to the 414 identified thing. The Base Name has Prefixes (NAAN, Label, possibly 415 a Resolver Service) and optional Suffixes to identify Parts and 416 Variant forms. During resolution, a Resolver Service such as n2t.net 417 may be able to deal with inflections query strings, and content 418 negotiation. 420 ANATOMY DETAILS 421 =============== 422 Base Compact Name Qualifiers 423 _________________ ___________ 424 / \/ \ 425 https://example.org/ark:12345/x6np1wh8k/c3/s5.v7.xsl 426 \_________/ \__/\___/\_/\_____/\____/\_____/ 427 NMA Label NAAN | Blade Parts Variants 428 Shoulder 429 \_____________/ 430 Check Zone 432 In a closer view, the Compact ARK consists of a Base Compact Name 433 followed potentially by Qualifiers. The Base Name often, but not 434 necessarily, consists of a Shoulder (for subdividing a NAAN 435 namespace) followed by a Blade. If a check character is present in 436 an ARK, by convention it is the right-most character of the Base 437 Name, and will have been computed over the string of characters 438 preceding it back to the beginning of the NAAN. This string, 439 including the check character itself, is the Check Zone. 441 x 443 Like the ARK itself, the NAAN "12345" and Shoulder "x6" have compact 444 and fully qualified forms. 446 +==========+=======+==============+================================+ 447 | Form | Base | Compact Form | Fully Qualified Form | 448 +==========+=======+==============+================================+ 449 | NAAN | 12345 | ark:12345 | https://example.org/ark:12345 | 450 +----------+-------+--------------+--------------------------------+ 451 | Shoulder | x6 | ark:12345/x6 | https://example.org/ark:12345/ | 452 | | | | x6 | 453 +----------+-------+--------------+--------------------------------+ 455 Table 1: Examples of base, compact, and fully qualified forms 457 x 459 The ARK syntax can be summarized, 461 [https://NMA/]ark:[/]NAAN/Name[Qualifiers] 463 where the NMA, '/', and Qualifier parts are in brackets to indicate 464 that they are optional. The Base Compact Name is the substring 465 comprising the "ark:" label, the NAAN and the assigned Name. The 466 Resolver Service is replaceable and makes the ARK actionable for a 467 period of time. Without the Resolver Service part, what remains is 468 the Core Immutable Identity (the "persistible") part of the ARK. 470 2.1. The Name Mapping Authority (NMA) 472 Before the "ark:" label may appear an optional Name Mapping Authority 473 (NMA) that is a temporary address where ARK service requests may be 474 sent. Preceded by a URI-type protocol designation such as 475 "https://", it specifies a Resolver Service. The NMA itself is an 476 Internet hostname or host/port combination, optionally followed by 477 URI-type path components, all ending in a '/'. The hostname has the 478 same format and semantics as the host/port part of a URL. In any 479 optional path that follows it, the path is considered to end with the 480 '/' in the first occurrence of "/ark:". 482 The most important thing about the NMA is that it is "identity inert" 483 from the point of view of object identification. In other words, 484 ARKs that differ only in the optional NMA part identify the same 485 object. Thus, for example, the following three ARKs are synonyms for 486 just one information object: 488 http://example.org/rslvr/ark:12345/x54xz321 489 https://example.com/ark:12345/x54xz321 490 ark:12345/x54xz321 492 Strictly speaking, in the realm of digital objects, these ARKs may 493 lead over time to somewhat different or diverging instances of the 494 originally named object. It can be argued that divergence of 495 persistent objects is not desirable, but it is widely believed that 496 digital preservation efforts will inevitably lead to alterations in 497 some original objects (e.g, a format migration in order to preserve 498 the ability to display a document). If any of those objects are held 499 redundantly in more than one organization (a common preservation 500 strategy), chances are small that all holding organizations will 501 perform the same precise transformations and all maintain the same 502 object metadata. More significant divergence would be expected when 503 the holding organizations serve different audiences or compete with 504 each other. 506 The NMA part makes an ARK into an actionable URL. As with many 507 Internet parameters, it is helpful to approach the NMA being liberal 508 in what you accept and conservative in what you propose. From the 509 recipient's point of view, the NMA part should be treated as 510 temporary, disposable, and replaceable. From the NMA's point of 511 view, it should be chosen with the greatest concern for longevity. A 512 carefully chosen NMA should be at least as permanent as the providing 513 organization's own hostname. In the case of a national or university 514 library, for example, there is no reason why the NMA could not be 515 considerably more permanent than soft-funded proxy hostnames such as 516 hdl.handle.net, dx.doi.org, and purl.org. In general and over time, 517 however, it is not unexpected for an NMA eventually to stop working 518 and require replacement with the NMA of a currently active service 519 provider. 521 This replacement relies on a mapping authority "resolver" discovery 522 process, of which two alternate methods are outlined in a later 523 section. The ARK, URN, Handle, and DOI schemes all use a resolver 524 discovery model that sooner or later requires matching the original 525 assigning authority with a current provider servicing that 526 authority's named objects; once found, the resolver at that provider 527 performs what amounts to a redirect to a place where the object is 528 currently held. All the schemes rely on the ongoing functionality of 529 currently mainstream technologies such as the Domain Name System 530 [RFC1034] and web browsers. The Handle and DOI schemes in addition 531 require that the Handle protocol layer and global server grid be 532 available at all times. 534 The practice of prepending "https://" and an NMA to an ARK is a way 535 of creating an actionable identifier by a method that is itself 536 temporary. Assuming that infrastructure supporting [RFC2616] 537 information retrieval will no longer be available one day, ARKs will 538 then have to be converted into new kinds of actionable identifiers. 539 By that time, if ARKs see widespread use, web browsers would 540 presumably evolve to perform this (currently simple) transformation 541 automatically. 543 2.2. The ARK Label Part (ark:) 545 The label part distinguishes an ARK from an ordinary identifier. 546 There is a new form of the label, "ark:", and an old form, "ark:/", 547 both of which must be recognized in perpetuity. Implementations 548 should generate new ARKs in the new form (without the "/") and 549 resolvers must always treat received ARKs as equivalent if they 550 differ only in regard to new form versus old form labels. Thus these 551 two ARKs are equivalent: 553 ark:/12345/x54xz321 554 ark:12345/x54xz321 556 In a URL found in the wild, the label indicates that the URL stands a 557 reasonable chance of being an ARK. If the context warrants, 558 verification that it actually is an ARK can be done by testing it for 559 existence of the three ARK services. 561 Since nothing about an identifier syntax directly affects 562 persistence, the "ark:" label (like "urn:", "doi:", and "hdl:") 563 cannot tell you whether the identifier is persistent or whether the 564 object is available. It does tell you that the original Name 565 Assigning Authority (NAA) had some sort of hopes for it, but it 566 doesn't tell you whether that NAA is still in existence, or whether a 567 decade ago it ceased to have any responsibility for providing 568 persistence, or whether it ever had any responsibility beyond naming. 570 Only a current provider can say for certain what sort of commitment 571 it intends, and the ARK label suggests that you can query the NMA 572 directly to find out exactly what kind of persistence is promised. 573 Even if what is promised is impersistence (i.e., a short-term 574 identifier), saying so is valuable information to the recipient. 575 Thus an ARK is a high-functioning identifier in the sense that it 576 provides access to the object, the metadata, and a commitment 577 statement, even if the commitment is explicitly very weak. 579 2.3. The Name Assigning Authority Number (NAAN) 581 Recalling that the general form of the ARK is, 583 [https://NMA/]ark:[/]NAAN/Name[Qualifiers] 585 the part of the ARK directly following the "ark:" (or older "ark:/") 586 label is the Name Assigning Authority Number (NAAN), up to but not 587 including the next `/' (slash) character. This part is always 588 required, as it identifies the organization that originally assigned 589 the Name of the object. Typically the organization is an 590 institution, a department, a laboratory, or any group that conducts a 591 stable, policy-driven name assigning effort. An organization may 592 request a NAAN from the ARK Maintenance Agency [ARKagency] (described 593 in Appendix A) by filling out the form [NAANrequest]. 595 For received ARKs, implementations must support a minimum NAAN length 596 of 16 octets. NAANs are opaque strings of one or more "betanumeric" 597 characters, specifically, 599 0123456789bcdfghjkmnpqrstvwxz 601 which consists of digits and consonants, minus the letter 'l'. 602 Restricting NAANs to betanumerics (alphanumerics without vowels or 603 'l') serves two goals. It reduces the chances that words -- past, 604 present, and future -- will appear in NAANs and carry unintended 605 semantics. It also helps usability by not mixing commonly confused 606 characters ('0' and 'O', '1' and 'l') and by being compatible with 607 strong transcription error detection (eg, the [NOID] check digit 608 algorithm). Since 2001, every assigned NAAN has consisted of exactly 609 five digits. 611 The NAAN designates a top-level ARK namespace. Once registered for a 612 namespace, a NAAN is never re-registered. It is possible, however, 613 for there to be a succession of organizations that manage an ARK 614 namespace. 616 There are currently four NAANs available to all organizations. An 617 ARK bearing one of these NAANs carries a specific, immutable meaning 618 that recipients can rely on for long term pragmatic benefit as 619 described below. 621 +==========+================================+==========+============+ 622 | Shared | The immutable purpose, | Expect | OK for | 623 | NAAN | meaning, or connotation of | to | long term | 624 | meaning | ARKs bearing this NAAN. | resolve? | reference? | 625 +==========+================================+==========+============+ 626 | 12345 | Example ARKs appearing in | maybe | no | 627 | examples | documentation. They might | | | 628 | | resolve, but no link checker | | | 629 | | need be concerned if they | | | 630 | | don't. They should not be | | | 631 | | considered viable for long | | | 632 | | term reference. | | | 633 +----------+--------------------------------+----------+------------+ 634 | 99152 | ARKs for controlled | yes | yes | 635 | terms | vocabulary and ontology | | | 636 | | terms, such as metadata | | | 637 | | element names and pick-list | | | 638 | | values. They should resolve | | | 639 | | to term definitions and are | | | 640 | | suitable for long term | | | 641 | | reference. | | | 642 +----------+--------------------------------+----------+------------+ 643 | 99166 | ARKs for people, groups, and | yes | yes | 644 | agents | institutions as "agents" | | | 645 | | (actors, such as creators, | | | 646 | | contributors, publishers, | | | 647 | | performers, etc). They | | | 648 | | should resolve to agent | | | 649 | | definitions and are suitable | | | 650 | | for long term reference. | | | 651 +----------+--------------------------------+----------+------------+ 652 | 99999 | ARKs for test, development, | maybe | no | 653 | test ids | or experimental purposes, | | | 654 | | often at scale. They might | | | 655 | | resolve, but no link checker | | | 656 | | need be concerned if they | | | 657 | | don't. They should not be | | | 658 | | considered viable for long | | | 659 | | term reference. | | | 660 +----------+--------------------------------+----------+------------+ 662 Table 2: Four NAANs shared across all organizations 664 To make use of a shared NAAN, an organization has several options 665 described in Section 2.4.1. 667 2.4. The Name Part 669 The part of the ARK just after the NAAN is the Name assigned by the 670 NAA, and it is also required. Semantic opaqueness in the Name part 671 is strongly encouraged in order to reduce an ARK's vulnerability to 672 era- and language-specific change. Identifier strings containing 673 linguistic fragments can create support difficulties down the road. 674 No matter how appropriate or even meaningless they are today, such 675 fragments may one day create confusion, give offense, or infringe on 676 a trademark as the semantic environment around us and our communities 677 evolves. 679 Names that look more or less like numbers avoid common problems that 680 defeat persistence and international acceptance. The use of digits 681 is highly recommended. Mixing in non-vowel alphabetic characters 682 (eg, betanumerics) a couple at a time is a relatively safe and easy 683 way to achieve a denser namespace (more possible names for a given 684 length of the name string). Such names have a chance of aging and 685 traveling well. The absence of recognizable words makes typos harder 686 to detect in opaque strings, so a common mitigation is to add a check 687 character. Tools exists that mint, bind, and resolve opaque 688 identifiers, with or without check characters [NOID]. More on naming 689 considerations is given in a subsequent section. 691 2.4.1. Optional: Shoulders 693 Just as an ARK namespace is subdivided by NAANs reserved for NAAs, it 694 is generally advantageous for an NAA to subdivide its own NAAN 695 namespace into "shoulders", where each shoulder is reserved for an 696 internal department or unit. Like the NAAN, which is a string of 697 characters that follows the "ark:" label, a shoulder is a string of 698 characters (starting with a "/") that extends the NAAN. The base 699 compact name assigned by the NAA consists of the NAAN, the shoulder, 700 a final string known as the "blade". (The shoulder plus blade 701 terminology mirrors locksmith jargon describing the information- 702 bearing parts of a key.) 704 The blade string is chosen by the NAA such that the string created by 705 concatenating the NAAN plus shoulder plus blade becomes the unique 706 base object name. Otherwise the blade may come from any source, for 707 example, it might come from a counter, a timestamp, a [NOID] minter, 708 a legacy 100-year-old accession number, etc. If there is a check 709 digit, it is expected to appear at the end of the blade and to be 710 computed over the base compact name, which is generally the most 711 important part of an ARK to make opaque. In particular, check digits 712 are not expected to cover qualifiers, which often name subobjects of 713 a persistent object that are less stable and less opaquely named than 714 the parent object (for example, ten years hence, the object's 715 thumbnail image will be of a higher resolution and the OCR text file 716 will be re-derived with improved algorithms. 718 It is important not to use any delimiter between the shoulder string 719 and blade string, especially not a "/" since it declares an object 720 boundary (see the section on ARKs that reveal object hierarchy). 722 ark:12345/x5wf6789/c2/s4.pdf # correct primordinal shoulder 723 ark:12345/x5/wf6789/c2/s4.pdf # INCORRECT 724 ^ WRONG 726 This little bit of discretion shields organizations from end users 727 making inferences about expected levels of support based on 728 recognizable shoulders. To help in-house ARK administrators reliably 729 know where the shoulder ends, it is recommended to use the "first- 730 digit convention" so that shoulders are "primordinal". A primordinal 731 shoulder is a sequence of one or more betanumeric characters ending 732 in a digit, as shown above. This means that the shoulder is all 733 consonant letters (often just one) after the NAAN and "/" up to and 734 including the first digit encountered after the NAAN. One property 735 of primordinal shoulders is that there is an infinite number of them 736 possible under any NAAN. 738 To help manage each namespace into the future, NAAs are encouraged to 739 create shoulders, even if there is only one to start with. If an 740 organization wishes to create a shoulder under one of shared NAANs 741 (99999, 12345, 99152, or 99166, described in Table 2), it should fill 742 out the Shoulder Request Form [shoulderrequest]. 744 2.5. The Qualifier Part 746 The part of the ARK following the NAA-assigned Name is an optional 747 Qualifier. It is a string that extends the Base Name in order to 748 create a kind of service entry point into the object named by the 749 NAA. At the discretion of the providing NMA, such a service entry 750 point permits an ARK to support access to individual hierarchical 751 components and subcomponents of an object, and to variants (versions, 752 languages, formats) of components. A Qualifier may be invented by 753 the NAA or by any NMA servicing the object. 755 In form, the Qualifier is a ComponentPath, or a VariantPath, or a 756 ComponentPath followed by a VariantPath. A VariantPath is introduced 757 and subdivided by the reserved character `.', and a ComponentPath is 758 introduced and subdivided by the reserved character `/'. In this 759 example, 761 https://example.org/ark:12345/x54xz321/s3/f8.v05.tiff 763 the string "/s3/f8" is a ComponentPath and the string ".v05.tiff" is 764 a VariantPath. The ARK Qualifier is a formalization of some 765 currently mainstream URL syntax conventions. This formalization 766 specifically reserves meanings that permit recipients to make strong 767 inferences about logical sub-object containment and equivalence based 768 only on the form of the received identifiers; there is great 769 efficiency in not having to inspect metadata records to discover such 770 relationships. NMAs are free not to disclose any of these 771 relationships merely by avoiding the reserved characters above. 772 Hierarchical components and variants are discussed further in the 773 next two sections. 775 The Qualifier, if present, differs from the Name in several important 776 respects. First, a Qualifier may have been assigned either by the 777 NAA or later by the NMA. The assignment of a Qualifier by an NMA 778 effectively amounts to an act of publishing a service entry point 779 within the conceptual object originally named by the NAA. For our 780 purposes, an ARK extended with a Qualifier assigned by an NMA will be 781 called an NMA-qualified ARK. 783 Second, a Qualifier assignment on the part of an NMA is made in 784 fulfillment of its service obligations and may reflect changing 785 service expectations and technology requirements. NMA-qualified ARKs 786 could therefore be transient, even if the base, unqualified ARK is 787 persistent. For example, it would be reasonable for an NMA to 788 support access to an image object through an actionable ARK that is 789 considered persistent even if the experience of that access changes 790 as linking, labeling, and presentation conventions evolve and as 791 format and security standards are updated. For an image "thumbnail", 792 that NMA could also support an NMA-qualified ARK that is considered 793 impersistent because the thumbnail will be replaced with higher 794 resolution images as network bandwidth and CPU speeds increase. At 795 the same time, for an originally scanned, high-resolution master, the 796 NMA could publish an NMA-qualfied ARK that is itself considered 797 persistent. Of course, the NMA must be able to return its separate 798 commitments to unqualified, NAA-assigned ARKs, to NMA-qualified ARKs, 799 and to any NAA-qualified ARKs that it supports. 801 A third difference between a Qualifier and a Name concerns the 802 semantic opaqueness constraint. When an NMA-qualified ARK is to be 803 used as a transient service entry point into a persistent object, the 804 priority given to semantic opaqueness observed by the NAA in the Name 805 part may be relaxed by the NMA in the Qualifier part. If service 806 priorities in the Qualifier take precedence over persistence, short- 807 term usability considerations may recommend somewhat semantically 808 laden Qualifier strings. 810 Finally, not only is the set of Qualifiers supported by an NMA 811 mutable, but different NMAs may support different Qualifier sets for 812 the same NAA-identified object. In this regard the NMAs act 813 independently of each other and of the NAA. 815 The next two sections describe how ARK syntax may be used to declare, 816 or to avoid declaring, certain kinds of relatedness among qualified 817 ARKs. 819 2.5.1. ARKs that Reveal Object Hierarchy 821 An NAA or NMA may choose to reveal the presence of a hierarchical 822 relationship between objects using the `/' (slash) character after 823 the Name part of an ARK. Some authorities will choose not to 824 disclose this information, while others will go ahead and disclose so 825 that manipulators of large sets of ARKs can infer object 826 relationships by simple identifier inspection; for example, this 827 makes it possible for a system to present a collapsed view of a large 828 search result set. 830 If the ARK contains an internal slash after the NAAN, the piece to 831 its left indicates a containing object. For example, publishing an 832 ARK of the form, 834 ark:12345/x54/xz/321 836 is equivalent to publishing three ARKs, 837 ark:12345/x54/xz/321 838 ark:12345/x54/xz 839 ark:12345/x54 841 together with a declaration that the first object is contained in the 842 second object, and that the second object is contained in the third. 844 Revealing the presence of hierarchy is completely up to the assigner 845 (NMA or NAA). It is hard enough to commit to one object's name, let 846 alone to three objects' names and to a specific, ongoing relatedness 847 among them. Thus, regardless of whether hierarchy was present 848 initially, the assigner, by not using slashes, reveals no shared 849 inferences about hierarchical or other inter-relatedness in the 850 following ARKs: 852 ark:12345/x54_xz_321 853 ark:12345/x54_xz 854 ark:12345/x54xz321 855 ark:12345/x54xz 856 ark:12345/x54 858 Note that slashes around the ARK's NAAN (/12345/ in these examples) 859 are not part of the ARK's Name and therefore do not indicate the 860 existence of some sort of NAAN super object containing all objects in 861 its namespace. A slash must have at least one non-structural 862 character (one that is neither a slash nor a period) on both sides in 863 order for it to separate recognizable structural components. So 864 initial or final slashes may be removed, and double slashes may be 865 converted into single slashes. 867 2.5.2. ARKs that Reveal Object Variants 869 An NAA or NMA may choose to reveal the possible presence of variant 870 objects or object components using the `.' (period) character after 871 the Name part of an ARK. Some authorities will choose not to 872 disclose this information, while others will go ahead and disclose so 873 that manipulators of large sets of ARKs can infer object 874 relationships by simple identifier inspection. This makes it 875 possible for a system to present a collapsed view of a large number 876 of search result items without having to issue database queries in 877 order to retrieve and analyze the inter-relatedness among all of 878 those items. 880 If the ARK contains an internal period after the Name, the piece to 881 the left of the first such period is a root name and the piece to its 882 right, and up to the end of the ARK or to the next period is a 883 suffix. A Name may have more than one suffix, for example, 884 ark:12345/x54.24 885 ark:12345/x4z/x54.24 886 ark:12345/x54.v18.fr.odf 888 There are two main rules. First, if two ARKs share the same root 889 name but have different suffixes, the corresponding objects were 890 considered variants of each other (different formats, languages, 891 versions, etc.) by the assigner (NMA or NAA). Thus, the following 892 ARKs are variants of each other: 894 ark:12345/x54.v18.fr.odf 895 ark:12345/x54.321xz 896 ark:12345/x54.44 898 Second, publishing an ARK with a suffix implies the existence of at 899 least one variant identified by the ARK without its suffix. The ARK 900 is otherwise silent about what additional variants might exist. So 901 publishing the ARK, 903 ark:12345/x54.v18.fr.odf 905 is equivalent to publishing the four ARKs, 907 ark:12345/x54.v18.fr.odf 908 ark:12345/x54.v18.fr 909 ark:12345/x54.v18 910 ark:12345/x54 912 Revealing the possibility of variants is completely up to the 913 assigner. It is hard enough to commit to one object's name, let 914 alone to multiple variants' names and to a specific, ongoing 915 relatedness among them. The assigner is the sole arbiter of what 916 constitutes a variant within its namespace, and whether to reveal 917 that kind of relatedness by using periods within its names. 919 A period must have at least one non-structural character (one that is 920 neither a slash nor a period) on both sides in order for it to 921 separate recognizable structural components. So initial or final 922 periods may be removed, and adjacent periods may be converted into a 923 single period. 925 2.6. Character Repertoires 927 The Name and Qualifier parts are strings of visible ASCII characters. 928 For received ARKs, implementations must support a minimum length of 929 255 octets for the string composed of the Base Name plus Qualifier. 930 Implementations generating strings exceeding this length should 931 understand that receiving implementations may not be able to index 932 such ARKs properly. Characters may be letters, digits, or any of 933 these seven characters: 935 = ~ * + @ _ $ 937 The following characters may also be used, but their meanings are 938 reserved: 940 % - . / 942 The characters `/' and `.' are ignored if either appears as the last 943 character of an ARK. If used internally, they allow a name assigner 944 to reveal object hierarchy and object variants as previously 945 described. 947 Hyphens are considered to be insignificant and are always ignored in 948 ARKs. A `-' (hyphen) may appear in an ARK for readability, or it may 949 have crept in during the formatting and wrapping of text, but it must 950 be ignored in lexical comparisons. As in a telephone number, hyphens 951 have no meaning in an ARK. It is always safe for an NMA that 952 receives an ARK to remove any hyphens found in it. As a result, like 953 the NMA, hyphens are "identity inert" in comparing ARKs for 954 equivalence. For example, the following ARKs are equivalent for 955 purposes of comparison and ARK service access: 957 ark:12345/x5-4-xz-321 958 https://sneezy.dopey.com/ark:12345/x54--xz32-1 959 ark:12345/x54xz321 961 The `%' character is reserved for %-encoding all other octets that 962 would appear in the ARK string, in the same manner as for URIs 963 [RFC3986]. A %-encoded octet consists of a `%' followed by two 964 uppercase hex digits; for example, "%7D" stands in for `}'. 965 Uppercase hex digits are preferred for compatibility with URI 966 encoding conventions, especially useful when URL-based ARKs are 967 compared for equivalence by ARK-unaware software systems; thus use 968 "%ACT" instead of "%acT". The character `%' itself must be 969 represented using "%25". As with URNs, %-encoding permits ARKs to 970 support legacy namespaces (e.g., ISBN, ISSN, SICI) that have less 971 restricted character repertoires [RFC2288]. 973 Implementors should be prepared to normalize some common invalid 974 characters that may be found in ARKs copy pasted from processed text. 975 For example, when pasting an ARK that was broken during line 976 wrapping, a user may inadvertently propagate newlines, spaces, 977 hyphens, and hyphen-like characters (eg, U+2010 to U+2015) that were 978 introduced by the publisher. The normalization strategy is up to the 979 implementor and may include converting hyphen-like characters to 980 hyphens and removing whitespace. 982 2.7. Normalization and Lexical Equivalence 984 To determine if two or more ARKs identify the same object, the ARKs 985 are compared for lexical equivalence after first being normalized. 986 Since ARK strings may appear in various forms (e.g., having different 987 NMAs), normalizing them minimizes the chances that comparing two ARK 988 strings for equality will fail unless they actually identify 989 different objects. In a specified-host ARK (one having an NMA), the 990 NMA never participates in such comparisons. Normalization described 991 here serves to define lexical equivalence but does not restrict how 992 implementors normalize ARKs locally for storage. 994 Normalization of a received ARK for the purpose of octet-by-octet 995 equality comparison with another ARK consists of the following steps. 997 1. The NMA part (eg, everything from an initial "https://" up to 998 the first occurrence of "/ark:"), if present is removed. 1000 2. Any URI query string is removed (everything from the first 1001 literal '?' to the end of the string). 1003 3. The first case-insensitive match on "ark:/" or "ark:" is 1004 converted to "ark:" (replacing any uppercase letters and 1005 removing any terminal '/'). 1007 4. Any uppercase letters in the NAAN are converted to lowercase. 1009 5. In the string that remains, the two characters following every 1010 occurrence of `%' are converted to uppercase. The case of all 1011 other letters in the ARK string must be preserved. 1013 6. All hyphens are removed. Implementors should be aware that non- 1014 ASCII hyphen-like characters (eg, U+2010 to U+2015) may arrive 1015 in the place of hyphens and, if they wish, remove them. 1017 7. If normalization is being done as part of a resolution step, and 1018 if the end of the remaining string matches a known inflection, 1019 the inflection is noted and removed. 1021 8. Structural characters (slash and period) are normalized: initial 1022 and final occurrences are removed, and two structural characters 1023 in a row (e.g., // or ./) are replaced by the first character, 1024 iterating until each occurrence has at least one non-structural 1025 character on either side. 1027 9. If there are any components with a period on the left and a 1028 slash on the right, either the component and the preceding 1029 period must be moved to the end of the Name part or the ARK must 1030 be thrown out as malformed. 1032 The resulting ARK string is now normalized. Comparisons between 1033 normalized ARKs are case-sensitive, meaning that uppercase letters 1034 are considered different from their lowercase counterparts. 1036 To keep ARK string variation to a minimum, no reserved ARK characters 1037 should be %-encoded unless it is deliberately to conceal their 1038 reserved meanings. No non-reserved ARK characters should ever be 1039 %-encoded. Finally, no %-encoded character should ever appear in an 1040 ARK in its decoded form. 1042 2.8. Resolver Chains 1044 Resolution is a computation, often multi-stage, that maps a client 1045 identifier to a response. The response may be any "thing", such as a 1046 spreadsheet, a landing page, a metadata record, or a 404 Not Found. 1047 A single-stage retrieval of a web page is a resolution. More 1048 interesting kinds of resolution involve forwarding (indirection) and/ 1049 or proxying. 1051 On the web, forwarding is done with HTTP redirects. In general ARK 1052 resolution on the web involves a chain of one or more redirects that 1053 ends with the web server, known as the Responder, that responds 1054 without redirecting. The Responder might be a proxy and itself 1055 intiate a sub-resolution request chain unbeknownst to the original 1056 client, but that is out of scope here. An ARK might have a Resource 1057 Responder that is a different host from the Metadata Responder. The 1058 client starts resolution by contacting the NMA (server host) found in 1059 the original Mapping ARK URL. This is known as the First Resolver. 1061 3. Naming Considerations 1063 The most important threats faced by persistence providers include 1064 such things as funding loss, natural disaster, political and social 1065 upheaval, processing faults, and errors in human oversight. There is 1066 nothing that an identifer scheme can do about such things. Still, a 1067 few observed identifier failures and inconveniences can be traced 1068 back to naming practices that we now know to be less than optimal for 1069 persistence. 1071 3.1. ARKS and Usability 1073 Because linguistic constructs imperil persistence, for ARKs non-ASCII 1074 character support is unimportant. ARKs and URIs share goals of 1075 transcribability and transportability within web documents, so 1076 characters are required to be visible, non-conflicting with HTML/XML 1077 syntax, and not subject to tampering during transmission across 1078 common transport gateways. 1080 Any measure that reduces user irritation with an identifier will 1081 increase its chances of survival. This explains the rule preventing 1082 hyphens from having lexical significance. It is fine to publish ARKs 1083 with hyphens in them (e.g., such as the output of UUID/GUID 1084 generators), but the uniform treatment of hyphens (and their Unicode 1085 equivalents) as insignificant reduces the possibility of users 1086 transcribing identifiers that will have been broken through 1087 unpredictable hyphenation by word processors. 1089 3.2. Objects Should Wear Their Identifiers 1091 A valuable technique for provision of persistent objects is to try to 1092 arrange for the complete identifier to appear on, with, or near its 1093 retrieved object. An object encountered at a moment in time when its 1094 discovery context has long since disappeared could then easily be 1095 traced back to its metadata, to alternate versions, to updates, etc. 1096 This has seen reasonable success, for example, in book publishing and 1097 software distribution. An identifier string only has meaning when 1098 its association is known, and this a very sure, simple, and low-tech 1099 method of reminding everyone exactly what that association is. 1101 3.3. Names are Political, not Technological 1103 If persistence is the goal, a deliberate local strategy for 1104 systematic name assignment is crucial. Names must be chosen with 1105 great care. Poorly chosen and managed names will devastate any 1106 persistence strategy, and they do not discriminate by identifier 1107 scheme. Whether a mistakenly re-assigned name is a URN, DOI, PURL, 1108 URL, or ARK, the damage -- failed access and confusion -- is not 1109 mitigated more in one scheme than in another. Conversely, in-house 1110 efforts to manage names responsibly will go much further towards 1111 safeguarding persistence than any choice of naming scheme or name 1112 resolution technology. 1114 Branding (e.g., at the corporate or departmental level) is important 1115 for funding and visibility, but substrings representing brands and 1116 organizational names should be given a wide berth except when 1117 absolutely necessary in the hostname (the identity-inert) part of the 1118 ARK. These substrings are not only unstable because organizations 1119 change frequently, but they are also dangerous because successor 1120 organizations often have political or legal reasons to actively 1121 suppress predecessor names and brands. Any measure that reduces the 1122 chances of future political or legal pressure on an identifier will 1123 decrease the chances that our descendants will be obliged to 1124 deliberately break it. 1126 3.4. Choosing a Hostname or NMA 1128 Hostnames appearing in any identifier meant to be persistent must be 1129 chosen with extra care. The tendency in hostname selection has 1130 traditionally been to choose a token with recognizable attributes, 1131 such as a corporate brand, but that tendency wreaks havoc with 1132 persistence that is supposed to outlive brands, corporations, subject 1133 classifications, and natural language semantics (e.g., what did the 1134 three letters "gay" mean in 1958, 1978, and 1998?). Today's 1135 recognized and correct attributes are tomorrow's stale or incorrect 1136 attributes. In making hostnames (any names, actually) long-term 1137 persistent, it helps to eliminate recognizable attributes to the 1138 extent possible. This affects selection of any name based on URLs, 1139 including PURLs and the explicitly disposable NMAs. 1141 There is no excuse for a provider that manages its internal names 1142 impeccably not to exercise the same care in choosing what could be an 1143 exceptionally durable hostname, especially if it would form the 1144 prefix for all the provider's URL-based external names. Registering 1145 an opaque hostname in the ".org" or ".net" domain would not be a bad 1146 start. Another way is to publish your ARKs with an organizational 1147 domain name that will be mapped by DNS to an appropriate NMA host. 1148 This makes for shorter names with less branding vulnerability. 1150 It is a mistake to think that hostnames are inherently unstable. If 1151 you require brand visibility, that may be a fact of life. But things 1152 are easier if yours is the brand of long-lived cultural memory 1153 institution such as a national or university library or archive. 1154 Well-chosen hostnames from organizations that are sheltered from the 1155 direct effects of a volatile marketplace can easily provide longer- 1156 lived global resolvers than the domain names explicitly or implicitly 1157 used as starting points for global resolution by indirection-based 1158 persistent identifier schemes. For example, it is hard to imagine 1159 circumstances under which the Library of Congress' domain name would 1160 disappear sooner than, say, "handle.net". 1162 For smaller libraries, archives, and preservation organizations, 1163 there is a natural concern about whether they will be able to keep 1164 their web servers and domain names in the face of uncertain funding. 1165 One option is to form or join a group of like-minded organizations 1166 with the purpose of providing mutual preservation support. The first 1167 goal of such a group would be to perpetually rent a hostname on which 1168 to establish a web server that simply redirects incoming member 1169 organization requests to the appropriate member server; using ARKs, 1170 for example, a 150-member group could run a very small server (24x7) 1171 that contained nothing more than 150 rewrite rules in its 1172 configuration file. Even more helpful would be additional consortial 1173 support for a member organization that was unable to continue 1174 providing services and needed to find a successor archival 1175 organization. This would be a low-cost, low-tech way to publish ARKs 1176 (or URLs) under highly persistent hostnames. 1178 There are no obvious reasons why the organizations registering DNS 1179 names, URN Namespaces, and DOI publisher IDs should have among them 1180 one that is intrinsically more fallible than the next. Moreover, it 1181 is a misconception that the demise of DNS and of HTTP need adversely 1182 affect the persistence of URLs. At such a time, certainly URLs from 1183 the present day might not then be actionable by our present-day 1184 mechanisms, but resolution systems for future non-actionable URLs are 1185 no harder to imagine than resolution systems for present-day non- 1186 actionable URNs and DOIs. There is no more stable a namespace than 1187 one that is dead and frozen, and that would then characterize the 1188 space of names bearing the "http://" or "https://" prefix. It is 1189 useful to remember that just because hostnames have been carelessly 1190 chosen in their brief history does not mean that they are unsuitable 1191 in NMAs (and URLs) intended for use in situations demanding the 1192 highest level of persistence available in the Internet environment. 1193 A well-planned name assignment strategy is everything. 1195 3.5. Assigners of ARKs 1197 A Name Assigning Authority (NAA) is an organization that creates (or 1198 delegates creation of) long-term associations between identifiers and 1199 information objects. Examples of NAAs include national libraries, 1200 national archives, and publishers. An NAA may arrange with an 1201 external organization for identifier assignment. The US Library of 1202 Congress, for example, allows OCLC (the Online Computer Library 1203 Center, a major world cataloger of books) to create associations 1204 between Library of Congress call numbers (LCCNs) and the books that 1205 OCLC processes. A cataloging record is generated that testifies to 1206 each association, and the identifier is included by the publisher, 1207 for example, in the front matter of a book. 1209 An NAA does not so much create an identifier as create an 1210 association. The NAA first draws an unused identifier string from 1211 its namespace, which is the set of all identifiers under its control. 1212 It then records the assignment of the identifier to an information 1213 object having sundry witnessed characteristics, such as a particular 1214 author and modification date. A namespace is usually reserved for an 1215 NAA by agreement with recognized community organizations (such as 1216 IANA and ISO) that all names containing a particular string be under 1217 its control. In the ARK an NAA is represented by the Name Assigning 1218 Authority Number (NAAN). 1220 The ARK namespace reserved for an NAA is the set of names bearing its 1221 particular NAAN. For example, all strings beginning with 1222 "ark:12345/" are under control of the NAA registered under 12345, 1223 which might be the National Library of Finland. Because each NAA has 1224 a different NAAN, names from one namespace cannot conflict with those 1225 from another. Each NAA is free to assign names from its namespace 1226 (or delegate assignment) according to its own policies. These 1227 policies must be documented in a manner similar to the declarations 1228 required for URN Namespace registration [RFC2611]. 1230 Organizations can request or update a NAAN by filling out the NAAN 1231 Request Form [NAANrequest]. 1233 3.6. NAAN Namespace Management 1235 Every NAA should have a namespace management strategy. A classic 1236 hierarchical approach is to partition a NAAN namespace into 1237 subnamespaces known as "shoulders". As explained in Section 2.4.1, 1238 each shoulder is a unique prefix that guarantees non-collision of 1239 names in different partitions. This practice is strongly encouraged 1240 for all NAAs, especially when subnamespace management and assignment 1241 streams will be delegated to departments, units, or projects within 1242 an organization. For example, with a NAAN that is assigned to a 1243 university and managed by its main library, the library should take 1244 care to reserve shoulders (semantically opaque shoulders being 1245 preferred) for distinct assignment streams. Prefix-based partition 1246 management is typically an important responsibility of the NAA. 1248 This shoulder delegation approach plays out differently in two real- 1249 world examples: DNS names and ISBN identifiers. In the former, the 1250 hierarchy is deliberately exposed and in the latter it is hidden. 1251 Rather than using lexical boundary markers such as the period (`.') 1252 found in domain names, the ISBN uses a publisher prefix but doesn't 1253 disclose where the prefix ends and the publisher's assigned name 1254 begins. This practice of non-disclosure, found in the ISBN and ISSN 1255 schemes, is encouraged in assigning ARKs because it reduces the 1256 visibility of an assertion that is probably not important now and may 1257 become a vulnerability later. 1259 If longevity is the goal, it is important to keep the prefixes free 1260 of recognizable semantics; for example, using an acronym representing 1261 a project or a department is discouraged. At the same time, you may 1262 wish to set aside a subnamespace for testing purposes under a 1263 shoulder such as "fk9..." that can serve as a visual clue and 1264 reminder to maintenance staff that this "fake" identifier was never 1265 published. 1267 There are other measures one can take to avoid user confusion, 1268 transcription errors, and the appearance of accidental semantics when 1269 creating identifiers. If you are generating identifiers 1270 automatically, pure numeric identifiers are likeley to be 1271 semantically opaque enough, but it's probably useful to avoid leading 1272 zeroes because some users mistakenly treat them as optional, thinking 1273 (arithmetically) that they don't contribute to the "value" of the 1274 identifier. 1276 If you need lots of identifiers and you don't want them to get too 1277 long, you can mix digits with consonants (but avoid vowels since they 1278 might accidentally spell words) to get more identifiers without 1279 increasing the string length. In this case you may not want more 1280 than a two letters in a row because it reduces the chance of 1281 generating acronyms. Generator tools such as [NOID] provide support 1282 for these sorts of identifiers, and can also add a computed check 1283 character as a guarantee against the most common transcription 1284 errors. If used, it is recommended that the check character be 1285 appended to the original Base Compact Name string (ie, minus the 1286 check character), that original string having been the basis for 1287 computing the check character. 1289 3.7. Sub-Object Naming 1291 As mentioned previously, semantically opaque identifiers are very 1292 useful for long-term naming of abstract objects, however, it may be 1293 appropriate to extend these names with less opaque extensions that 1294 reference contemporary service entry points (sub-objects) in support 1295 of the object. Sub-object extensions beginning with a digit or 1296 underscore (`_') are reserved for the possibilty of developing a 1297 future registry of canonical service points (e.g., numeric references 1298 to versions, formats, languages, etc). 1300 4. Finding a Name Mapping Authority 1302 In order to derive an actionable identifier (these days, a URL) from 1303 an ARK, a hostname (or hostname plus port combination) for a working 1304 Name Mapping Authority (NMA) must be found. An NMA is a service that 1305 is able to respond to basic ARK service requests. Relying on 1306 registration and client-side discovery, NMAs make known which NAAs' 1307 identifiers they are willing to service. 1309 Upon encountering an ARK, a user (or client software) looks inside it 1310 for the optional NMA part (usually the host part of the NMA's ARK 1311 service). If it contains an NMA that is working, this NMA discovery 1312 step may be skipped; the NMA effectively uses the beginning of an ARK 1313 to cache the results of a prior mapping authority discovery process. 1314 If a new NMA needs to be found, the client looks inside the ARK again 1315 for the NAAN (Name Assigning Authority Number). Querying a global 1316 database, it then uses the NAAN to look up all current NMAs that 1317 service ARKs issued by the identified NAA. 1319 The global database is key, and ideally the lookup would be automatic 1320 and transparent to the user. For this, the current most promising 1321 method is the Name-to-Thing (N2T) Resolver [N2T] at n2t.net. It is a 1322 reliable, low-cost NMA supported by the ARK Alliance that primarily 1323 exists to support actionable HTTP-based URLs for as long as HTTP is 1324 used. One of its big advantages over the other two methods and the 1325 URN, Handle, DOI, and PURL methods, is that N2T addresses the 1326 namespace splitting problem. When objects maintained by one NMA are 1327 inherited by more than one successor NMA, until now one of those 1328 successors would be required to maintain forwarding tables on behalf 1329 of the other successors. 1331 There are two other ways to discover an NMA, one of them described in 1332 a subsection below. Another way, described in an appendix, is based 1333 on a simplification of the URN resolver discovery method, itself very 1334 similar in principle to the resolver discovery method used by Handles 1335 and DOIs. None of these methods does more than what can be done with 1336 a very small, consortially maintained web server such as [N2T]. 1338 In the interests of long-term persistence, however, ARK mechanisms 1339 are first defined in high-level, protocol-independent terms so that 1340 mechanisms may evolve and be replaced over time without compromising 1341 fundamental service objectives. Either or both specific methods 1342 given here may eventually be supplanted by better methods since, by 1343 design, the ARK scheme does not depend on a particular method, but 1344 only on having some method to locate an active NMA. 1346 At the time of issuance, at least one NMA for an ARK should be 1347 prepared to service it. That NMA may or may not be administered by 1348 the Name Assigning Authority (NAA) that created it. Consider the 1349 following hypothetical example of providing long-term access to a 1350 cancer research journal. The publisher wishes to turn a profit and 1351 the National Library of Medicine wishes to preserve the scholarly 1352 record. An agreement might be struck whereby the publisher would act 1353 as the NAA and the national library would archive the journal issue 1354 when it appears, but without providing direct access for the first 1355 six months. During the first six months of peak commercial 1356 viability, the publisher would retain exclusive delivery rights and 1357 would charge access fees. Again, by agreement, both the library and 1358 the publisher would act as NMAs, but during that initial period the 1359 library would redirect requests for issues less than six months old 1360 to the publisher. At the end of the waiting period, the library 1361 would then begin servicing requests for issues older than six months 1362 by tapping directly into its own archives. Meanwhile, the publisher 1363 might routinely redirect incoming requests for older issues to the 1364 library. Long-term access is thereby preserved, and so is the 1365 commercial incentive to publish content. 1367 Although it will be common for an NAA also to run an NMA service, it 1368 is never a requirement. Over time NAAs and NMAs will come and go. 1369 One NMA will succeed another, and there might be many NMAs serving 1370 the same ARKs simultaneously (e.g., as mirrors or as competitors). 1371 There might also be asymmetric but coordinated NMAs as in the 1372 library-publisher example above. 1374 4.1. Looking Up NMAs in a Globally Accessible File 1376 This subsection describes a way to look up NMAs using a simple name 1377 authority table represented as a plain text file. For efficient 1378 access the file may be stored in a local filesystem, but it needs to 1379 be reloaded periodically to incorporate updates. It is not expected 1380 that the size of the file or frequency of update should impose an 1381 undue maintenance or searching burden any time soon, for even 1382 primitive linear search of a file with ten-thousand NAAs is a 1383 subsecond operation on modern server machines. The proposed file 1384 strategy is similar to the /etc/hosts file strategy that supported 1385 Internet host address lookup for a period of years before the advent 1386 of DNS. 1388 The name authority table file is updated on an ongoing basis and is 1389 available for copying over the Internet from a number of mirror sites 1390 [NAANregistry]. The file contains comment lines (lines that begin 1391 with `#') explaining the format and giving the file's modification 1392 time, reloading address, and NAA registration instructions. 1394 5. Generic ARK Service Definition 1396 An ARK request's output is delivered information; examples include 1397 the object itself, a policy declaration (e.g., a promise of support), 1398 a descriptive metadata record, or an error message. The experience 1399 of object delivery is expected to be an evolving mix of information 1400 that reflects changing service expectations and technology 1401 requirements; contemporary examples include such things as an object 1402 summary and component links formatted for human consumption. ARK 1403 services must be couched in high-level, protocol-independent terms if 1404 persistence is to outlive today's networking infrastructural 1405 assumptions. The high-level ARK service definitions listed below are 1406 followed in the next section by a concrete method (one of many 1407 possible methods) for delivering these services with today's 1408 technology. Note that some services may be invoked in one operation, 1409 such as when an '?info' inflection returns both a description and a 1410 permanence declaration for an object. 1412 5.1. Generic ARK Access Service (access, location) 1414 Returns (a copy of) the object or a redirect to the same, although a 1415 sensible object proxy may be substituted. Examples of sensible 1416 substitutes include, 1418 * a table of contents instead of a large complex document, 1420 * a home page instead of an entire web site hierarchy, 1422 * a rights clearance challenge before accessing protected data, 1424 * directions for access to an offline object (e.g., a book), 1426 * a description of an intangible object (a disease, an event), or 1428 * an applet acting as "player" for a large multimedia object. 1430 May also return a discriminated list of alternate object locators. 1431 If access is denied, returns an explanation of the object's current 1432 (perhaps permanent) inaccessibility. 1434 5.1.1. Generic Policy Service (permanence, naming, etc.) 1436 Returns declarations of policy and support commitments for given 1437 ARKs. Declarations are returned in either a structured metadata 1438 format or a human readable text format; sometimes one format may 1439 serve both purposes. Policy subareas may be addressed in separate 1440 requests, but the following areas should be covered: object 1441 permanence, object naming, object fragment addressing, and 1442 operational service support. 1444 The permanence declaration for an object is a rating defined with 1445 respect to an identified permanence provider (guarantor), which will 1446 be the NMA. It may include the following aspects. 1448 (a) "object availability" -- whether and how access to the object 1449 is supported (e.g., online 24x7, or offline only), 1451 (b) "identifier validity" -- under what conditions the identifier 1452 will be or has been re-assigned, 1454 (c) "content invariance" -- under what conditions the content of 1455 the object is subject to change, and 1457 (d) "change history" -- access to corrections, migrations, and 1458 revisions, whether through links to the changed objects themselves 1459 or through a document summarizing the change history 1461 One approach to persistence statements, conceived independently from 1462 ARKs, can be found at [PStatements], with ongoing work available at 1463 [ARKspecs]. An older approach to a permanence rating framework is 1464 given in [NLMPerm], which identified the following "permanence 1465 levels": 1467 Not Guaranteed: No commitment has been made to retain this 1468 resource. It could become unavailable at any time. Its 1469 identifier could be changed. 1471 Permanent: Dynamic Content: A commitment has been made to keep 1472 this resource permanently available. Its identifier will always 1473 provide access to the resource. Its content could be revised or 1474 replaced. 1476 Permanent: Stable Content: A commitment has been made to keep this 1477 resource permanently available. Its identifier will always 1478 provide access to the resource. Its content is subject only to 1479 minor corrections or additions. 1481 Permanent: Unchanging Content: A commitment has been made to keep 1482 this resource permanently available. Its identifier will always 1483 provide access to the resource. Its content will not change. 1485 Naming policy for an object includes an historical description of the 1486 NAA's (and its successor NAA's) policies regarding differentiation of 1487 objects. Since it is the NMA that responds to requests for policy 1488 statements, it is useful for the NMA to be able to produce or 1489 summarize these historical NAA documents. Naming policy may include 1490 the following aspects. 1492 (i) "similarity" -- (or "unity") the limit, defined by the NAA, to 1493 the level of dissimilarity beyond which two similar objects 1494 warrant separate identifiers but before which they share one 1495 single identifier, and 1497 (ii) "granularity" -- the limit, defined by the NAA, to the level 1498 of object subdivision beyond which sub-objects do not warrant 1499 separately assigned identifiers but before which sub-objects are 1500 assigned separate identifiers. 1502 Subnaming policy for an object describes the qualifiers that the NMA, 1503 in fulfilling its ongoing and evolving service obligations, allows as 1504 extensions to an NAA-assigned ARK. To the conceptual object that the 1505 NAA named with an ARK, the NMA may add component access points and 1506 derivatives (e.g., format migrations in aid of preservation) in order 1507 to provide both basic and value-added services. 1509 Addressing policy for an object includes a description of how, during 1510 access, object components (e.g., paragraphs, sections) or views 1511 (e.g., image conversions) may or may not be "addressed", in other 1512 words, how the NMA permits arguments or parameters to modify the 1513 object delivered as the result of an ARK request. If supported, 1514 these sorts of operations would provide things like byte-ranged 1515 fragment delivery and open-ended format conversions, or any set of 1516 possible transformations that would be too numerous to list or to 1517 identify with separately assigned ARKs. 1519 Operational service support policy includes a description of general 1520 operational aspects of the NMA service, such as after-hours staffing 1521 and trouble reporting procedures. 1523 5.1.2. Generic Description Service 1525 Returns a description of the object. Descriptions are returned in a 1526 structured metadata format, a human-readable text format, or in one 1527 format that serves both purposes (such as human-readable HTML with 1528 embedded machine-readable metadata, or perhaps YAML). A description 1529 must at a minimum answer the who, what, when, and where questions 1530 ("where" being the long-term identifier as opposed to a transient 1531 redirect target) concerning an expression of the object. Standalone 1532 descriptions should be accompanied by the modification date and 1533 source of the description itself. May also return discriminated 1534 lists of ARKs that are related to the given ARK. 1536 5.2. Overview of The HTTP URL Mapping Protocol (THUMP) 1538 The HTTP URL Mapping Protocol (THUMP) is a way of taking a key (any 1539 identifier) and asking such questions as, what information does this 1540 identify and how permanent is it? [THUMP] is in fact one specific 1541 method under development for delivering ARK services. The protocol 1542 runs over HTTP to exploit the web browser's current pre-eminence as 1543 user interface to the Internet. THUMP is designed so that a person 1544 can enter ARK requests directly into the location field of current 1545 browser interfaces. Because it runs over HTTP, THUMP can be 1546 simulated and tested via keyboard-based interactions [RFC0854]. 1548 The asker (a person or client program) starts with an identifier, 1549 such as an ARK or a URL. The identifier reveals to the asker (or 1550 allows the asker to infer) the Internet host name and port number of 1551 a server system that responds to questions. Here, this is just the 1552 NMA that is obtained by inspection and possibly lookup based on the 1553 ARK's NAAN. The asker then sets up an HTTP session with the server 1554 system, sends a question via a THUMP request (contained within an 1555 HTTP request), receives an answer via a THUMP response (contained 1556 within an HTTP response), and closes the session. That concludes the 1557 connected portion of the protocol. 1559 A THUMP request is a string of characters beginning with a `?' 1560 (question mark) that is appended to the identifier string. The 1561 resulting string is sent as an argument to HTTP's GET command. 1562 Request strings too long for GET may be sent using HTTP's POST 1563 command. The two most common requests correspond to two degenerate 1564 special cases. First, a simple key with no request at all is the 1565 same as an ordinary access request. Thus a plain ARK entered into a 1566 browser's location field behaves much like a plain URL, and returns 1567 access to the primary identified object, for instance, an HTML 1568 document. 1570 The second special case is a minimal ARK description request string 1571 consisting of just "?info". For example, entering the string, 1573 n2t.net/ark:67531/metadc107835?info 1575 into the browser's location field directly precipitates a request for 1576 a metadata record describing the object identified by ark:67531/ 1577 metadc107835. The browser, unaware of THUMP, prepares and sends an 1578 HTTP GET request in the same manner as for a URL. THUMP is designed 1579 so that the response (indicated by the returned HTTP content type) is 1580 normally displayed, whether the output is structured for machine 1581 processing (text/plain) or formatted for human consumption (text/ 1582 html). In addition to '?info', this specification reserves both '?' 1583 and '??' (originally older forms) for future use. 1585 The following example THUMP session assumes metadata being returned 1586 by a resolver (as server) to a browser client. Each line has been 1587 annotated to include a line number and whether it was the client or 1588 server that sent it. Without going into much depth, the session has 1589 four pieces separated from each other by blank lines: the client's 1590 piece (lines 1-3), the server's HTTP/THUMP response headers (4-8), 1591 and the body of the server's response (9-18). The first and last 1592 lines (1 and 19) correspond to the client's steps to start the TCP 1593 session and the server's steps to end it, respectively. 1595 1 C: [opens session] 1596 C: GET https://n2t.net/ark:67531/metadc107835?info HTTP/1.1 1597 C: 1598 S: HTTP/1.1 200 OK 1599 5 S: Content-Type: text/plain 1600 S: THUMP-Status: 0.6 200 OK 1601 S: Link: rel="describes"; 1602 S: 1603 S: erc: 1604 10 S: who: Austin, Larry 1605 S: what: A Study of Rhythm in Bach's Orgelbüchlein 1606 S: when: 1952 1607 S: where: https://digital.library.unt.edu/ark:/67531/metadc107835 1608 S: erc-support: 1609 15 S: who: University of North Texas Libraries 1610 S: what: Permanent: Stable Content: 1611 S: when: 20081203 1612 S: where: https://digital.library.unt.edu/ark:/67531/ 1613 S: [closes session] 1615 The first two server response lines (4-5) above are typical of HTTP. 1616 The next line (6) is peculiar to THUMP, and indicates the THUMP 1617 version and a normal return status. The final header line (7) 1618 asserts, for the benefit of recipients unfamiliar with ARK 1619 inflections, that the response describes the uninflected ARK. 1621 The balance of the response consists of a single metadata record 1622 (9-18) that comprises the ARK description service response. The 1623 returned record is in the format of an Electronic Resource Citation 1624 [ERC], which is discussed in overview in the next section. For now, 1625 note that it contains four elements that answer the top priority 1626 questions regarding an expression of the object: who played a major 1627 role in expressing it, what the expression was called, when it was 1628 created, and where the expression may be found (note that "where" is 1629 preferably a persistent, citable identifier rather than an unstable 1630 URL sometimes mistakenly referred to as a "location"). This quartet 1631 of elements comes up again and again in ERCs. Lines 13-17 contain a 1632 minimal persistence statement. 1634 Each segment in an ERC tells a different story relating to the 1635 object, so although the same four questions (elements) appear in 1636 each, the answers depend on the segment's story type. While the 1637 first segment tells the story of an expression of the object, the 1638 second segment tells the story of the support commitment made to it: 1639 who made the commitment, what the nature of the commitment was, when 1640 it was made, and where a fuller explanation of the commitment may be 1641 found. 1643 5.3. The Electronic Resource Citation (ERC) 1645 An Electronic Resource Citation (or ERC, pronounced e-r-c) [ERC] is a 1646 kind of object description that uses Dublin Core Kernel metadata 1647 elements [DCKernel]. The ERC with Kernel elements provides a simple, 1648 compact, and printable record for holding data associated with an 1649 information resource. As originally designed [Kernel], Kernel 1650 metadata balances the needs for expressive power, very simple machine 1651 processing, and direct human manipulation. The ERC sense of 1652 "citation" is not limited to the traditional referencing of a result 1653 or information fixed in time on a printed page, but to a more general 1654 kind of reference, both backward, to digital material that cannot be 1655 known to be fixed in time (true of virtually all online information), 1656 and forward, to material that is all the more valuable for improving 1657 or evolving over time. 1659 The previous section shows two limited examples of what is fully 1660 described elsewhere [ERC]. The rest of this short section provides 1661 some of the background and rationale for this record format. 1663 A founding principle of Kernel metadata is that direct human contact 1664 with metadata will be a necessary and sufficient condition for the 1665 near term rapid development of metadata standards, systems, and 1666 services. Thus the machine-processable Kernel elements must only 1667 minimally strain people's ability to read, understand, change, and 1668 transmit ERCs without their relying on intermediation with 1669 specialized software tools. The basic ERC needs to be succinct, 1670 transparent, and trivially parseable by software. 1672 Borrowing from the data structuring format that underlies the 1673 successful spread of email and web services, the ERC format uses 1674 [ANVL], which is based on email and HTTP headers [RFC2822]. There is 1675 a naturalness to ANVL's label-colon-value format (seen in the 1676 previous section) that barely needs explanation to a person beginning 1677 to enter ERC metadata. 1679 While ANVL elements are expected at the top level and don't 1680 themselves support hierarchy, the value of an ANVL element may be an 1681 arbitrary encoded hierarchy of JSON or XML. Typically, the name of 1682 such an ANVL element ends in "json" or "xml", for example, "json" or 1683 "geojson". Care should be taken to escape structural characters that 1684 appear in element names and values, specifically, line terminators 1685 (both newlines ("\n") and carriage returns ("\r")) and, in element 1686 names, colons (":"). 1688 Besides simplicity of ERC system implementation and data entry 1689 mechanics, ERC semantics (what the record and its constituent parts 1690 mean) must also be easy to explain. ERC semantics are based on a 1691 reformulation and extension of the Dublin Core [RFC5013] hypothesis, 1692 which suggests that the fifteen Dublin Core metadata elements have a 1693 key role to play in cross-domain resource description. The ERC 1694 design recognizes that the Dublin Core's primary contribution is the 1695 international, interdisciplinary consensus that identified fifteen 1696 semantic buckets (element categories), regardless of how they are 1697 labeled. The ERC then adds a definition for a record and some 1698 minimal compliance rules. In pursuing the limits of simplicity, the 1699 ERC design combines and relabels some Dublin Core buckets to isolate 1700 a tiny kernel (subset) of four elements for basic cross-domain 1701 resource description. 1703 For the cross-domain kernel, the ERC uses the four basic elements -- 1704 who, what, when, and where -- to pretend that every object in the 1705 universe can have a uniform minimal description. Each has a name or 1706 other identifier, a locator (a means to access it), some responsible 1707 person or party, and a date. It doesn't matter what type of object 1708 it is, or whether one plans to read it, interact with it, smoke it, 1709 wear it, or navigate it. Of course, this approach is flawed because 1710 uniformity of description for some object types requires more 1711 semantic contortion and sacrifice than for others. That is why at 1712 the beginning of this document, the ARK was said to be suited to 1713 objects that accommodate reasonably regular electronic description. 1715 While insisting on uniformity at the most basic level provides 1716 powerful cross-domain leverage, the semantic sacrifice is great for 1717 many applications. So the ERC also permits a semantically rich and 1718 nuanced description to co-exist in a record along with a basic 1719 description. In that way both sophisticated and naive recipients of 1720 the record can extract the level of meaning from it that best suits 1721 their needs and abilities. Key to unlocking the richer description 1722 is a controlled vocabulary of ERC record types (not explained in this 1723 document) that permit knowledgeable recipients to apply defined sets 1724 of additional assumptions to the record. 1726 5.4. Advice to Web Clients 1728 ARKs are envisaged to appear wherever durable object references are 1729 planned. Library cataloging records, literature citations, and 1730 bibliographies are important examples. In many of these places URLs 1731 (Uniform Resource Locators) are currently used, and inside some of 1732 those URLs are embedded URNs, Handles, and DOIs. Unfortunately, 1733 there's no suggestion of a way to probe for extra services that would 1734 build confidence in those identifiers; in other words, there's no way 1735 to tell whether any of those identifiers is any better managed than 1736 the average URL. 1738 ARKs are also envisaged to appear in hypertext links (where they are 1739 not normally shown to users) and in rendered text (displayed or 1740 printed). A normal HTML link for which the URL is not displayed 1741 looks like this. 1743 Click Here 1745 A URL with an embedded ARK invites access (via `?info') to extra 1746 services: 1748 Click Here 1750 Using the [N2T] resolver to provide identifier-scheme-agnostic 1751 protection against hostname instability, this ARK could be published 1752 as: 1754 Click Here 1755 An NAA will typically make known the associations it creates by 1756 publishing them in catalogs, actively advertizing them, or simply 1757 leaving them on web sites for visitors (e.g., users, indexing 1758 spiders) to stumble across in browsing. 1760 5.5. Enhancements and Related Specifications 1762 ARK services, data models, inflections, and applications continue to 1763 evolve. Follow-on developments and specifications will be made 1764 available from the ARK Maintenance Agency [ARKspecs]. 1766 5.6. Security Considerations 1768 The ARK naming scheme poses no direct risk to computers and networks. 1769 Implementors of ARK services need to be aware of security issues when 1770 querying networks and filesystems for Name Mapping Authority 1771 services, and the concomitant risks from spoofing and obtaining 1772 incorrect information. These risks are no greater for ARK mapping 1773 authority discovery than for other kinds of service discovery. For 1774 example, recipients of ARKs with a specified host (NMA) should treat 1775 it like a URL and be aware that the identified ARK service may no 1776 longer be operational. 1778 Apart from mapping authority discovery, ARK clients and servers 1779 subject themselves to all the risks that accompany normal operation 1780 of the protocols underlying mapping services (e.g., HTTP, Z39.50). 1781 As specializations of such protocols, an ARK service may limit 1782 exposure to the usual risks. Indeed, ARK services may enhance a kind 1783 of security by helping users identify long-term reliable references 1784 to information objects. 1786 6. References 1788 [ANVL] Kunze, J., Kahle, B., Masanes, J., and G. Mohr, "A Name- 1789 Value Language", 2005, 1790 . 1792 [ARK] Kunze, J., "Towards Electronic Persistence Using ARK 1793 Identifiers", IWAW/ECDL Annual Workshop Proceedings 3rd, 1794 August 2003, . 1796 [ARKagency] 1797 Alliance, A., "ARK Maintenance Agency", 2021, 1798 . 1800 [ARKspecs] Alliance, A., "ARK Maintenance Agency Specifications", 1801 2021, . 1803 [DCKernel] Initiative, D. C. M., "Kernel Metadata Working Group", 1804 2001-2008, . 1806 [DOI] Foundation, I. D., "The Digital Object Identifier (DOI) 1807 System", February 2001, . 1809 [ERC] Kunze, J. and A. Turner, "Kernel Metadata and Electronic 1810 Resource Citations", October 2007, 1811 . 1813 [Handle] Lannom, L., "Handle System Overview", ICSTI Forum No. 30, 1814 April 1999, . 1816 [Kernel] Kunze, J., "A Metadata Kernel for Electronic Permanence", 1817 Journal of Digital Information Vol 2, Issue 2, 1818 ISSN 1368-7506, January 2002, 1819 . 1821 [N2T] Alliance, A., "Name-to-Thing Resolver", August 2006, 1822 . 1824 [NAANregistry] 1825 ARKs.org, "NAAN Registry", 2019, 1826 . 1828 [NAANrequest] 1829 ARKs.org, "NAAN Request Form", 2018, 1830 . 1832 [NLMPerm] Byrnes, M., "Permanence Levels and the Archives for NLM's 1833 Permanent Web Documents", March 2005, 1834 . 1837 [NOID] Kunze, J., "Nice Opaque Identifiers", April 2006, 1838 . 1840 [PStatements] 1841 Kunze, J., "Persistence statements: describing digital 1842 stickiness", October 2016, 1843 . 1845 [PURL] Shafer, K., "Introduction to Persistent Uniform Resource 1846 Locators", 1996, 1847 . 1850 [shoulderrequest] 1851 ARKs.org, "Shoulder Request Form", 2021, 1852 . 1854 [RFC0854] Postel, J. and J. Reynolds, "Telnet Protocol 1855 Specification", STD 8, RFC 854, DOI 10.17487/RFC0854, May 1856 1983, . 1858 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 1859 STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987, 1860 . 1862 [RFC2141] Moats, R., "URN Syntax", RFC 2141, DOI 10.17487/RFC2141, 1863 May 1997, . 1865 [RFC2288] Lynch, C., Preston, C., and R. Daniel, "Using Existing 1866 Bibliographic Identifiers as Uniform Resource Names", 1867 RFC 2288, DOI 10.17487/RFC2288, February 1998, 1868 . 1870 [RFC2611] Daigle, L., van Gulik, D., Iannella, R., and P. Faltstrom, 1871 "URN Namespace Definition Mechanisms", BCP 33, RFC 2611, 1872 DOI 10.17487/RFC2611, June 1999, 1873 . 1875 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 1876 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 1877 Transfer Protocol -- HTTP/1.1", RFC 2616, 1878 DOI 10.17487/RFC2616, June 1999, 1879 . 1881 [RFC2822] Resnick, P., Ed., "Internet Message Format", RFC 2822, 1882 DOI 10.17487/RFC2822, April 2001, 1883 . 1885 [RFC2915] Mealling, M. and R. Daniel, "The Naming Authority Pointer 1886 (NAPTR) DNS Resource Record", RFC 2915, 1887 DOI 10.17487/RFC2915, September 2000, 1888 . 1890 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1891 Resource Identifier (URI): Generic Syntax", STD 66, 1892 RFC 3986, DOI 10.17487/RFC3986, January 2005, 1893 . 1895 [RFC5013] Kunze, J. and T. Baker, "The Dublin Core Metadata Element 1896 Set", RFC 5013, DOI 10.17487/RFC5013, August 2007, 1897 . 1899 [THUMP] Gamiel, K. and J. Kunze, "The HTTP URL Mapping Protocol", 1900 August 2007, . 1903 Appendix A. ARK Maintenance Agency: arks.org 1905 The ARK Maintenance Agency [ARKagency] at arks.org has several 1906 functions. 1908 * To manage the registry of organizations that will be assigning 1909 ARKs. Organizations can request or update a NAAN by filling out 1910 the NAAN Request Form [NAANrequest]. 1912 * To be a clearinghouse for information about ARKs, such as best 1913 practices, introductory documentation, tutorials, community 1914 forums, etc. These supplemental resources help ARK implementor in 1915 high-level applications across different sectors and disciplines, 1916 and with a variety of metadata standards. 1918 * To be a locus of discussion about future versions of the ARK 1919 specification. 1921 Appendix B. Looking up NMAs Distributed via DNS 1923 This subsection introduces an older method for looking up NMAs that 1924 is based on the method for discovering URN resolvers described in 1925 [RFC2915]. It relies on querying the DNS system already installed in 1926 the background infrastructure of most networked computers. A query 1927 is submitted to DNS asking for a list of resolvers that match a given 1928 NAAN. DNS distributes the query to the particular DNS servers that 1929 can best provide the answer, unless the answer can be found more 1930 quickly in a local DNS cache as a side-effect of a recent query. 1931 Responses come back inside Name Authority Pointer (NAPTR) records. 1932 The normal result is one or more candidate NMAs. 1934 In its full generality the [RFC2915] algorithm ambitiously 1935 accommodates a complex set of preferences, orderings, protocols, 1936 mapping services, regular expression rewriting rules, and DNS record 1937 types. This subsection proposes a drastic simplification of it for 1938 the special case of ARK mapping authority discovery. The simplified 1939 algorithm is called Maptr. It uses only one DNS record type (NAPTR) 1940 and restricts most of its field values to constants. The following 1941 hypothetical excerpt from a DNS data file for the NAAN known as 12026 1942 shows three example NAPTR records ready to use with the Maptr 1943 algorithm. 1945 12026.ark.arpa. 1946 ;; US Library of Congress 1947 ;; order pref flags service regexp replacement 1948 IN NAPTR 0 0 "h" "ark" "USLC" lhc.nlm.nih.gov:8080 1949 IN NAPTR 0 0 "h" "ark" "USLC" foobar.zaf.org 1950 IN NAPTR 0 0 "h" "ark" "USLC" sneezy.dopey.com 1952 All the fields are held constant for Maptr except for the "flags", 1953 "regexp", and "replacement" fields. The "service" field contains the 1954 constant value "ark" so that NAPTR records participating in the Maptr 1955 algorithm will not be confused with other NAPTR records. The "order" 1956 and "pref" fields are held to 0 (zero) and otherwise ignored for now; 1957 the algorithm may evolve to use these fields for ranking decisions 1958 when usage patterns and local administrative needs are better 1959 understood. 1961 When a Maptr query returns a record with a flags field of "h" (for 1962 host, a Maptr extension to the NAPTR flags), the replacement field 1963 contains the NMA (host) of an ARK service provider. When a query 1964 returns a record with a flags field of "" (the empty string), the 1965 client needs to submit a new query containing the domain name found 1966 in the replacement field. This second sort of record exploits the 1967 distributed nature of DNS by redirecting the query to another domain 1968 name. It looks like this. 1970 12345.ark.arpa. 1971 ;; Digital Library Consortium 1972 ;; order pref flags service regexp replacement 1973 IN NAPTR 0 0 "" "ark" "" dlc.spct.org. 1975 Here is the Maptr algorithm for ARK mapping authority discovery. In 1976 it replace with the NAAN from the ARK for which an NMA is 1977 sought. 1979 1. Initialize the DNS query: type=NAPTR, query=.ark.arpa. 1981 2. Submit the query to DNS and retrieve (NAPTR) records, discarding 1982 any record that does not have "ark" for the service field. 1984 3. All remaining records with a flags fields of "h" contain 1985 candidate NMAs in their replacement fields. Set them aside, if 1986 any. 1988 4. Any record with an empty flags field ("") has a replacement field 1989 containing a new domain name to which a subsequent query should 1990 be redirected. For each such record, set query= 1991 then go to step (2). When all such records have been recursively 1992 exhausted, go to step (5). 1994 5. All redirected queries have been resolved and a set of candidate 1995 NMAs has been accumulated from steps (3). If there are zero 1996 NMAs, exit -- no mapping authority was found. If there is one or 1997 more NMA, choose one using any criteria you wish, then exit. 1999 A Perl script that implements this algorithm is included here. 2001 #!/depot/bin/perl 2003 use Net::DNS; # include simple DNS package 2004 my $qtype = "NAPTR"; # initialize query type 2005 my $naa = shift; # get NAAN script argument 2006 my $mad = new Net::DNS::Resolver; # mapping authority discovery 2008 &maptr("$naa.ark.arpa"); # call maptr - that's it 2010 sub maptr { # recursive maptr algorithm 2011 my $dname = shift; # domain name as argument 2012 my ($rr, $order, $pref, $flags, $service, $regexp, 2013 $replacement); 2014 my $query = $mad->query($dname, $qtype); 2015 return # non-productive query 2016 if (! $query || ! $query->answer); 2017 foreach $rr ($query->answer) { 2018 next # skip records of wrong type 2019 if ($rr->type ne $qtype); 2020 ($order, $pref, $flags, $service, $regexp, 2021 $replacement) = split(/\s/, $rr->rdatastr); 2022 if ($flags eq "") { 2023 &maptr($replacement); # recurse 2024 } elsif ($flags eq "h") { 2025 print "$replacement\n"; # candidate NMA 2026 } 2027 } 2028 } 2030 The global database thus distributed via DNS and the Maptr algorithm 2031 can easily be seen to mirror the contents of the Name Authority 2032 Table file described in the previous section. 2034 Authors' Addresses 2036 John A. Kunze 2037 California Digital Library 2038 1111 Franklin Street 2039 Oakland, CA 94607 2040 United States of America 2041 Email: jak@ucop.edu 2043 Emmanuelle Bermès 2044 Bibliothèque nationale de France 2045 Quai François Mauriac 2046 75706 Paris 2047 France 2049 Email: emmanuelle.bermes@bnf.fr