idnits 2.17.1 draft-kunze-ark-22.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([Qualifier]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1776 has weird spacing: '... regexp repla...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 22, 2019) is 1763 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'Qualifier' is mentioned on line 513, but not defined ** Obsolete normative reference: RFC 2141 (Obsoleted by RFC 8141) ** Obsolete normative reference: RFC 2611 (Obsoleted by RFC 3406) ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Obsolete normative reference: RFC 2822 (Obsoleted by RFC 5322) ** Obsolete normative reference: RFC 2915 (Obsoleted by RFC 3401, RFC 3402, RFC 3403, RFC 3404) Summary: 8 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Kunze 3 Internet-Draft California Digital Library 4 Intended status: Informational E. Bermes 5 Expires: December 24, 2019 Bibliotheque nationale de France 6 June 22, 2019 8 The ARK Identifier Scheme 9 draft-kunze-ark-22 11 Abstract 13 The ARK (Archival Resource Key) naming scheme is designed to 14 facilitate the high-quality and persistent identification of 15 information objects. A founding principle of the ARK is that 16 persistence is purely a matter of service and is neither inherent in 17 an object nor conferred on it by a particular naming syntax. The 18 best that an identifier can do is to lead users to the services that 19 support robust reference. The term ARK itself refers both to the 20 scheme and to any single identifier that conforms to it. An ARK has 21 five components: 23 [http://NMAH/]ark:[/]NAAN/Name[Qualifier] 25 an optional and mutable Name Mapping Authority Hostport (usually a 26 hostname), the "ark:" label, the Name Assigning Authority Number 27 (NAAN), the assigned Name, and an optional and possibly mutable 28 Qualifier supported by the NMA. The NAAN and Name together form the 29 immutable persistent identifier for the object independent of the URL 30 hostname. An ARK is a special kind of URL that connects users to 31 three things: the named object, its metadata, and the provider's 32 promise about its persistence. When entered into the location field 33 of a Web browser, the ARK leads the user to the named object. That 34 same ARK, inflected by appending a single question mark (`?'), 35 returns a brief metadata record that is both human- and machine- 36 readable. When the ARK is inflected by appending dual question marks 37 (`??'), the returned metadata contains a commitment statement from 38 the current provider. Tools exist for minting, binding, and 39 resolving ARKs. 41 Status of This Memo 43 This Internet-Draft is submitted in full conformance with the 44 provisions of BCP 78 and BCP 79. 46 Internet-Drafts are working documents of the Internet Engineering 47 Task Force (IETF). Note that other groups may also distribute 48 working documents as Internet-Drafts. The list of current Internet- 49 Drafts is at https://datatracker.ietf.org/drafts/current/. 51 Internet-Drafts are draft documents valid for a maximum of six months 52 and may be updated, replaced, or obsoleted by other documents at any 53 time. It is inappropriate to use Internet-Drafts as reference 54 material or to cite them other than as "work in progress." 56 This Internet-Draft will expire on December 24, 2019. 58 Copyright Notice 60 Copyright (c) 2019 IETF Trust and the persons identified as the 61 document authors. All rights reserved. 63 This document is subject to BCP 78 and the IETF Trust's Legal 64 Provisions Relating to IETF Documents 65 (https://trustee.ietf.org/license-info) in effect on the date of 66 publication of this document. Please review these documents 67 carefully, as they describe your rights and restrictions with respect 68 to this document. 70 Table of Contents 72 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 73 1.1. Reasons to Use ARKs . . . . . . . . . . . . . . . . . . . 4 74 1.2. Three Requirements of ARKs . . . . . . . . . . . . . . . 5 75 1.3. Organizing Support for ARKs: Our Stuff vs. Their Stuff . 6 76 1.4. Definition of Identifier . . . . . . . . . . . . . . . . 7 77 2. ARK Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . 8 78 2.1. The Name Mapping Authority Hostport (NMAH) . . . . . . . 9 79 2.2. The ARK Label Part (ark:) . . . . . . . . . . . . . . . . 11 80 2.3. The Name Assigning Authority Number (NAAN) . . . . . . . 11 81 2.4. The Name Part . . . . . . . . . . . . . . . . . . . . . . 12 82 2.5. The Qualifier Part . . . . . . . . . . . . . . . . . . . 13 83 2.5.1. ARKs that Reveal Object Hierarchy . . . . . . . . . . 14 84 2.5.2. ARKs that Reveal Object Variants . . . . . . . . . . 15 85 2.6. Character Repertoires . . . . . . . . . . . . . . . . . . 16 86 2.7. Normalization and Lexical Equivalence . . . . . . . . . . 17 87 3. Naming Considerations . . . . . . . . . . . . . . . . . . . . 19 88 3.1. ARKS Embedded in Language . . . . . . . . . . . . . . . . 19 89 3.2. Objects Should Wear Their Identifiers . . . . . . . . . . 19 90 3.3. Names are Political, not Technological . . . . . . . . . 20 91 3.4. Choosing a Hostname or NMA . . . . . . . . . . . . . . . 20 92 3.5. Assigners of ARKs . . . . . . . . . . . . . . . . . . . . 22 93 3.6. NAAN Namespace Management . . . . . . . . . . . . . . . . 22 94 3.7. Sub-Object Naming . . . . . . . . . . . . . . . . . . . . 24 95 4. Finding a Name Mapping Authority . . . . . . . . . . . . . . 24 96 4.1. Looking Up NMAHs in a Globally Accessible File . . . . . 25 97 5. Generic ARK Service Definition . . . . . . . . . . . . . . . 27 98 5.1. Generic ARK Access Service (access, location) . . . . . . 27 99 5.1.1. Generic Policy Service (permanence, naming, etc.) . . 27 100 5.1.2. Generic Description Service . . . . . . . . . . . . . 29 101 5.2. Overview of The HTTP URL Mapping Protocol (THUMP) . . . . 29 102 5.3. The Electronic Resource Citation (ERC) . . . . . . . . . 32 103 5.4. Advice to Web Clients . . . . . . . . . . . . . . . . . . 34 104 5.5. Security Considerations . . . . . . . . . . . . . . . . . 35 105 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 35 106 Appendix A. ARK Maintenance Agency: arks.org . . . . . . . . . . 37 107 Appendix B. Looking up NMAHs Distributed via DNS . . . . . . . . 38 108 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 40 110 1. Introduction 112 [ Note about this transitional draft. The ARKsInTheOpen.org 113 Technical Working Group (https://wiki.duraspace.org/display/ARKs/ 114 Technical+Working+Group) is in the process of revising the ARK spec 115 via a series of Internet-Drafts. No breaking changes from the 2008 116 spec are envisaged. Some minor changes are being deferred to later 117 in order to make it easier to review more important changes; some of 118 those small changes would result in "noisy diffs" since they are 119 global in scope, for example, converting all instances of http:// and 120 NMAH to https:// and NMA, respectively. ] 122 This document describes a scheme for the high-quality naming of 123 information resources. The scheme, called the Archival Resource Key 124 (ARK), is well suited to long-term access and identification of any 125 information resources that accommodate reasonably regular electronic 126 description. This includes digital documents, databases, software, 127 and websites, as well as physical objects (books, bones, statues, 128 etc.) and intangible objects (chemicals, diseases, vocabulary terms, 129 performances). Hereafter the term "object" refers to an information 130 resource. The term ARK itself refers both to the scheme and to any 131 single identifier that conforms to it. A reasonably concise and 132 accessible overview and rationale for the scheme is available at 133 [ARK]. 135 Schemes for persistent identification of network-accessible objects 136 are not new. In the early 1990's, the design of the Uniform Resource 137 Name [RFC2141] responded to the observed failure rate of URLs by 138 articulating an indirect, non-hostname-based naming scheme and the 139 need for responsible name management. Meanwhile, promoters of the 140 Digital Object Identifier [DOI] succeeded in building a community of 141 providers around a mature software system [Handle] that supports name 142 management. The Persistent Uniform Resource Locator [PURL] was 143 another scheme that had the advantage of working with unmodified web 144 browsers. ARKs represent an approach that attempts to build on the 145 strengths and to avoid the weaknesses of these schemes. 147 A founding principle of the ARK is that persistence is purely a 148 matter of service. Persistence is neither inherent in an object nor 149 conferred on it by a particular naming syntax. Nor is the technique 150 of name indirection -- upon which URNs, Handles, DOIs, and PURLs are 151 founded -- of central importance. Name indirection is an ancient and 152 well-understood practice; new mechanisms for it keep appearing and 153 distracting practitioner attention, with the Domain Name System (DNS) 154 [RFC1034] being a particularly dazzling and elegant example. What is 155 often forgotten is that maintenance of an indirection table is an 156 unavoidable cost to the organization providing persistence, and that 157 cost is equivalent across naming schemes. That indirection has 158 always been a native part of the web while being so lightly utilized 159 for the persistence of web-based objects indicates how unsuited most 160 organizations will probably be to the task of table maintenance and 161 to the much more fundamental challenge of keeping the objects 162 themselves viable. 164 Persistence is achieved through a provider's successful stewardship 165 of objects and their identifiers. The highest level of persistence 166 will be reinforced by a provider's robust contingency, redundancy, 167 and succession strategies. It is further safeguarded to the extent 168 that a provider's mission is shielded from funding and political 169 instabilities. These are by far the major challenges confronting 170 persistence providers, and no identifier scheme has any direct impact 171 on them. In fact, some schemes may actually be liabilities for 172 persistence because they create short- and long-term dependencies for 173 every object access on complex, special-purpose infrastructures, 174 parts of which are proprietary and all of which increase the carry- 175 forward burden for the preservation community. It is for this reason 176 that the ARK scheme relies only on educated name assignment and light 177 use of general-purpose infrastructures that are maintained mostly by 178 the internet community at large (the DNS, web servers, and web 179 browsers). 181 1.1. Reasons to Use ARKs 183 If no persistent identifier scheme contributes directly to 184 persistence, why not just use URLs? A particular URL may be as 185 durable an identifier as it is possible to have, but nothing 186 distinguishes it from an ordinary URL to the recipient who is 187 wondering if it is suitable for long-term reference. An ARK embedded 188 in a URL provides some of the necessary conditions for credible 189 persistence, inviting access to not one, but to three things: to the 190 object, to its metadata, and to a nuanced statement of commitment 191 from the provider in question (the NMA, described below) regarding 192 the object. Existence of the two extra services can be probed 193 automatically by appending `?' and `??' to the ARK. 195 The form of the ARK also supports the natural separation of naming 196 authorities into the original name assigning authority and the 197 diverse multiple name mapping (or servicing) authorities that in 198 succession and in parallel will take over custodial responsibilities 199 from the original assigner (assuming the assigner ever held that 200 responsibility) for the large majority of a long-term object's 201 archival lifetime. The name mapping authority, indicated by the 202 hostname part of the URL that contains the ARK, serves to launch the 203 ARK into cyberspace. Should it ever fail (and there is no reason why 204 a well-chosen hostname for a 100-year-old cultural memory institution 205 shouldn't last as long as the DNS), that host name is considered 206 disposeable and replaceable. Again, the form of the ARK helps 207 because it defines exactly how to recover the core immutable object 208 identity, and simple algorithms (one based on the URN model) or even 209 by-hand internet query can be used for for locating another mapping 210 authority. 212 There are tools to assist in generating ARKs and other identifiers, 213 such as [NOID] and "uuidgen", both of which rely for uniqueness on 214 human-maintained registries. This document also contains some 215 guidelines and considerations for managing namespaces and choosing 216 hostnames with persistence in mind. 218 1.2. Three Requirements of ARKs 220 The first requirement of an ARK is to give users a link from an 221 object to a promise of stewardship for it. That promise is a multi- 222 faceted covenant that binds the word of an identified service 223 provider to a specific set of responsibilities. It is critical for 224 the promise to come from a current provider and almost irrelevant, 225 over a long period of time, what the original assigner's intentions 226 were. No one can tell if successful stewardship will take place 227 because no one can predict the future. Reasonable conjecture, 228 however, may be based on past performance. There must be a way to 229 tie a promise of persistence to a provider's demonstrated or 230 perceived ability -- its reputation -- in that arena. Provider 231 reputations would then rise and fall as promises are observed 232 variously to be kept and broken. This is perhaps the best way we 233 have for gauging the strength of any persistence promise. 235 The second requirement of an ARK is to give users a link from an 236 object to a description of it. The problem with a naked identifier 237 is that without a description real identification is incomplete. 238 Identifiers common today are relatively opaque, though some contain 239 ad hoc clues reflecting assertions that were briefly true, such as 240 where in a filesystem hierarchy an object lived during a short stay. 241 Possession of both an identifier and an object is some improvement, 242 but positive identification may still be uncertain since the object 243 itself might not include a matching identifier or might not carry 244 evidence obvious enough to reveal its identity without significant 245 research. In either case, what is called for is a record bearing 246 witness to the identifier's association with the object, as supported 247 by a recorded set of object characteristics. This descriptive record 248 is partly an identification "receipt" with which users and archivists 249 can verify an object's identity after brief inspection and a 250 plausible match with recorded characteristics such as title and size. 252 The final requirement of an ARK is to give users a link to the object 253 itself (or to a copy) if at all possible. Persistent access is the 254 central duty of an ARK. Persistent identification plays a vital 255 supporting role but, strictly speaking, it can be construed as no 256 more than a record attesting to the original assignment of a never- 257 reassigned identifier. Object access may not be feasible for various 258 reasons, such as a transient service outage, a catastrophic loss, a 259 licensing agreement that keeps an archive "dark" for a period of 260 years, or when an object's own lack of tangible existence confuses 261 normal concepts of access (e.g., a vocabulary term might be 262 "accessed" through its definition). In such cases the ARK's 263 identification role assumes a much higher profile. But attempts to 264 simplify the persistence problem by decoupling access from 265 identification and concentrating exclusively on the latter are of 266 questionable utility. A perfect system for assigning forever unique 267 identifiers might be created, but if it did so without reducing 268 access failure rates, no one would be interested. The central issue 269 -- which may be summed up as the "HTTP 404 Not Found" problem -- 270 would not have been addressed. 272 1.3. Organizing Support for ARKs: Our Stuff vs. Their Stuff 274 An organization and the user community it serves can often be seen to 275 struggle with two different areas of persistent identification: the 276 Our Stuff problem and the Their Stuff problem. In the Our Stuff 277 problem, we in the organization want our own objects to acquire 278 persistent names. Since we possess or control these objects, our 279 organization tackles the Our Stuff problem directly. Whether or not 280 the objects are named by ARKs, our organization is the responsible 281 party, so it can plan for, maintain, and make commitments about the 282 objects. 284 In the Their Stuff problem, we in the organization want others' 285 objects to acquire persistent names. These are objects that we do 286 not own or control, but some of which are critically important to us. 287 But because they are beyond our influence as far as support is 288 concerned, creating and maintaining persistent identifiers for Their 289 Stuff is not especially purposeful or feasible for us to engage in. 290 There is little that we can do about someone else's stuff except 291 encourage their uptake or adoption of persistence services. 293 Co-location of persistent access and identification services is 294 natural. Any organization that undertakes ongoing support of true 295 persistent identification (which includes description) is well-served 296 if it controls, owns, or otherwise has clear internal access to the 297 identified objects, and this gives it an advantage if it wishes also 298 to support persistent access to outsiders. Conversely, persistent 299 access to outsiders requires orderly internal collection management 300 procedures that include monitoring, acquisition, verification, and 301 change control over objects, which in turn requires object 302 identifiers persistent enough to support auditable record keeping 303 practices. 305 Although, organizing ARK services under one roof thus tends to make 306 sense, object hosting can successfully be separated from name 307 mapping. An example is when a name mapping authority centrally 308 provides uniform resolution services via a protocol gateway on behalf 309 of organizations that host objects behind a variety of access 310 protocols. It is also reasonable to build value-added description 311 services that rely on the underlying services of a set of mapping 312 authorities. 314 Supporting ARKs is not for every organization. By requiring 315 specific, revealed commitments to preservation, to object access, and 316 to description, the bar for providing ARK services is higher than for 317 some other identifier schemes. On the other hand, it would be hard 318 to grant credence to a persistence promise from an organization that 319 could not muster the minimum ARK services. Not that there isn't a 320 business model for an ARK-like, description-only service built on top 321 of another organization's full complement of ARK services. For 322 example, there might be competition at the description level for 323 abstracting and indexing a body of scientific literature archived in 324 a combination of open and fee-based repositories. The description- 325 only service would have no direct commitment to the objects, but 326 would act as an intermediary, forwarding commitment statements from 327 object hosting services to requestors. 329 1.4. Definition of Identifier 331 An identifier is not a string of character data -- an identifier is 332 an association between a string of data and an object. This 333 abstraction is necessary because without it a string is just data. 334 It's nonsense to talk about a string's breaking, or about its being 335 strong, maintained, and authentic. But as a representative of an 336 association, a string can do, metaphorically, the things that we 337 expect of it. 339 Without regard to whether an object is physical, digital, or 340 conceptual, to identify it is to claim an association between it and 341 a representative string, such as "Jane" or "ISBN 0596000278". What 342 gives a claim credibility is a set of verifiable assertions, or 343 metadata, about the object, such as age, height, title, or number of 344 pages. In other words, the association is made manifest by a record 345 (e.g., a cataloging or other metadata record) that vouches for it. 347 In the complete absence of any testimony (metadata) regarding an 348 association, a would-be identifier string is a meaningless sequence 349 of characters. To keep an externally visible but otherwise internal 350 string from being perceived as an identifier by outsiders, for 351 example, it suffices for an organization not to disclose the nature 352 of its association. For our immediate purpose, actual existence of 353 an association record is more important than its authenticity or 354 verifiability, which are outside the scope of this specification. 356 It is a gift to the identification process if an object carries its 357 own name as an inseparable part of itself, such as an identifier 358 imprinted on the first page of a document or embedded in a data 359 structure element of a digital document header. In cases where the 360 object is large, unwieldy, or unavailable (such as when licensing 361 restrictions are in effect), a metadata record that includes the 362 identifier string will usually suffice. That record becomes a 363 conveniently manipulable object surrogate, acting as both an 364 association "receipt" and "declaration". 366 Note that our definition of identifier extends the one in use for 367 Uniform Resource Identifiers [RFC3986]. The present document still 368 sometimes (ab)uses the terms "ARK" and "identifier" as shorthand for 369 the string part of an identifier, but the context should make the 370 meaning clear. 372 2. ARK Anatomy 374 An ARK is represented by a sequence of characters (a string) that 375 contains the label, "ark:", optionally preceded by the beginning part 376 of a URL. Here is a diagrammed example. 378 ARK ANATOMY Core Immutable Identity 379 =========== _______________|_______________ 380 / \ 381 Resolver Service Base Object Name Qualifier 382 _________|_______ ________|_______ ______|______ 383 / \/ \/ \ 384 http://example.org/ark:12025/654xz321/s3/f8.05v.tiff 385 \_________/ \__/\___/ \______/\____/\_______/ 386 | | | | | | 387 | Label | | Sub-parts Variants 388 | | | 389 Name Mapping Authority | Assigned Name 390 Hostport (NMAH) | 391 Name Assigning Authority Number (NAAN) 393 The ARK syntax can be summarized, 395 [http://NMAH/]ark:[/]NAAN/Name[Qualifier] 397 where the NMAH, '/', and Qualifier parts are in brackets to indicate 398 that they are optional. The Base Object Name is the substring 399 comprising the "ark:" label, the NAAN and the assigned Name. The 400 Resolver Service is replaceable and makes the ARK actionable for a 401 period of time. Without the Resolver Service part, what remains is 402 the Core Immutable Identity (the "persistible") part of the ARK. 404 2.1. The Name Mapping Authority Hostport (NMAH) 406 Before the "ark:" label may appear an optional Name Mapping Authority 407 Hostport (NMAH) that is a temporary address where ARK service 408 requests may be sent. Preceded by a URI-type protocol designation 409 such as "https://", it specifies a Resolver Service. The NMAH itself 410 is an Internet hostname or hostport combination having the same 411 format and semantics as the hostport part of a URL. The most 412 important thing about the NMAH is that it is "identity inert" from 413 the point of view of object identification. In other words, ARKs 414 that differ only in the optional NMAH part identify the same object. 415 Thus, for example, the following three ARKs are synonyms for just one 416 information object: 418 http://loc.gov/ark:12025/654xz321 419 http://rutgers.edu/ark:12025/654xz321 420 ark:12025/654xz321 422 Strictly speaking, in the realm of digital objects, these ARKs may 423 lead over time to somewhat different or diverging instances of the 424 originally named object. In an ideal world, divergence of persistent 425 objects is not desirable, but it is widely believed that digital 426 preservation efforts will inevitably lead to alterations in some 427 original objects (e.g, a format migration in order to preserve the 428 ability to display a document). If any of those objects are held 429 redundantly in more than one organization (a common preservation 430 strategy), chances are small that all holding organizations will 431 perform the same precise transformations and all maintain the same 432 object metadata. More significant divergence would be expected when 433 the holding organizations serve different audiences or compete with 434 each other. 436 The NMAH part makes an ARK into an actionable URL. As with many 437 internet parameters, it is helpful to approach the NMAH being liberal 438 in what you accept and conservative in what you propose. From the 439 recipient's point of view, the NMAH part should be treated as 440 temporary, disposable, and replaceable. From the NMA's point of 441 view, it should be chosen with the greatest concern for longevity. A 442 carefully chosen NMAH should be at least as permanent as the 443 providing organization's own hostname. In the case of a national or 444 university library, for example, there is no reason why the NMAH 445 should not be considerably more permanent than soft-funded proxy 446 hostnames such as hdl.handle.net, dx.doi.org, and purl.org. In 447 general and over time, however, it is not unexpected for an NMAH 448 eventually to stop working and require replacement with the NMAH of a 449 currently active service provider. 451 This replacement relies on a mapping authority "resolver" discovery 452 process, of which two alternate methods are outlined in a later 453 section. The ARK, URN, Handle, and DOI schemes all use a resolver 454 discovery model that sooner or later requires matching the original 455 assigning authority with a current provider servicing that 456 authority's named objects; once found, the resolver at that provider 457 performs what amounts to a redirect to a place where the object is 458 currently held. All the schemes rely on the ongoing functionality of 459 currently mainstream technologies such as the Domain Name System 460 [RFC1034] and web browsers. The Handle and DOI schemes in addition 461 require that the Handle protocol layer and global server grid be 462 available at all times. 464 The practice of prepending "http://" and an NMAH to an ARK is a way 465 of creating an actionable identifier by a method that is itself 466 temporary. Assuming that infrastructure supporting [RFC2616] 467 information retrieval will no longer be available one day, ARKs will 468 then have to be converted into new kinds of actionable identifiers. 469 By that time, if ARKs see widespread use, web browsers would 470 presumably evolve to perform this (currently simple) transformation 471 automatically. 473 2.2. The ARK Label Part (ark:) 475 The label part distinguishes an ARK from an ordinary identifier. 476 There is a new form of the label, "ark:", and an old form, "ark:/", 477 both of which must be recognized in perpetuity. Implementations 478 should generate new ARKs in the new form (without the "/") and 479 resolvers must always treat received ARKs as equivalent if they 480 differ only in regard to new form versus old form labels. Thus these 481 two ARKs are equivalent: 483 ark:/12025/654xz321 484 ark:12025/654xz321 486 In a URL found in the wild, the label indicates that the URL stands a 487 reasonable chance of being an ARK. If the context warrants, 488 verification that it actually is an ARK can be done by testing it for 489 existence of the three ARK services. 491 Since nothing about an identifier syntax directly affects 492 persistence, the "ark:" label (like "urn:", "doi:", and "hdl:") 493 cannot tell you whether the identifier is persistent or whether the 494 object is available. It does tell you that the original Name 495 Assigning Authority (NAA) had some sort of hopes for it, but it 496 doesn't tell you whether that NAA is still in existence, or whether a 497 decade ago it ceased to have any responsibility for providing 498 persistence, or whether it ever had any responsibility beyond naming. 500 Only a current provider can say for certain what sort of commitment 501 it intends, and the ARK label suggests that you can query the NMAH 502 directly to find out exactly what kind of persistence is promised. 503 Even if what is promised is impersistence (i.e., a short-term 504 identifier), saying so is valuable information to the recipient. 505 Thus an ARK is a high-functioning identifier in the sense that it 506 provides access to the object, the metadata, and a commitment 507 statement, even if the commitment is explicitly very weak. 509 2.3. The Name Assigning Authority Number (NAAN) 511 Recalling that the general form of the ARK is, 513 [http://NMAH/]ark:[/]NAAN/Name[Qualifier] 515 the part of the ARK directly following the "ark:" (or older "ark:/") 516 label is the Name Assigning Authority Number (NAAN), up to but not 517 including the next `/' (slash) character. This part is always 518 required, as it identifies the organization that originally assigned 519 the Name of the object. It is used to discover a currently valid 520 NMAH and to provide top-level partitioning of the space of all ARKs. 522 An organization may request a NAAN from the ARK Maintenance Agency 523 [ARKagency] (described in Appendix A) by filling out the form at 524 [NAANrequest]. NAANs are opaque strings of one or more characters 525 drawn from this set, 527 0123456789bcdfghjkmnpqrstvwxz 529 which consists of digits and consonants, minus the letter 'l'. 530 Restricting NAANs to this set serves two goals. It reduces the 531 chances that words -- past, present, and future -- will appear in 532 NAANs and carry unintended semantics. It also helps usability by not 533 mixing commonly confused characters ('0' and 'O', '1' and 'l') and by 534 being compatible with strong transcription error detection (eg, the 535 [NOID] check digit algorithm). Since 2001, every assigned NAAN has 536 consisted of exactly five digits, and no immediate change in that 537 practice is foreseen. 539 The NAAN designates a top-level ARK namespace. Once registered for a 540 namespace, a NAAN is never re-registered. It is possible, however, 541 for there to be a succession of organizations that manage an ARK 542 namespace. 544 2.4. The Name Part 546 The part of the ARK just after the NAAN is the Name assigned by the 547 NAA, and it is also required. Semantic opaqueness in the Name part 548 is strongly encouraged in order to reduce an ARK's vulnerability to 549 era- and language-specific change. Identifier strings containing 550 linguistic fragments can create support difficulties down the road. 551 No matter how appropriate or even meaningless they are today, such 552 fragments may one day create confusion, give offense, or infringe on 553 a trademark as the semantic environment around us and our communities 554 evolves. 556 Names that look more or less like numbers avoid common problems that 557 defeat persistence and international acceptance. The use of digits 558 is highly recommended. Mixing in non-vowel alphabetic characters a 559 couple at a time is a relatively safe and easy way to achieve a 560 denser namespace (more possible names for a given length of the name 561 string). Such names have a chance of aging and traveling well. 562 Tools exists that mint, bind, and resolve opaque identifiers, with or 563 without check characters [NOID]. More on naming considerations is 564 given in a subsequent section. 566 2.5. The Qualifier Part 568 The part of the ARK following the NAA-assigned Name is an optional 569 Qualifier. It is a string that extends the base ARK in order to 570 create a kind of service entry point into the object named by the 571 NAA. At the discretion of the providing NMA, such a service entry 572 point permits an ARK to support access to individual hierarchical 573 components and subcomponents of an object, and to variants (versions, 574 languages, formats) of components. A Qualifier may be invented by 575 the NAA or by any NMA servicing the object. 577 In form, the Qualifier is a ComponentPath, or a VariantPath, or a 578 ComponentPath followed by a VariantPath. A VariantPath is introduced 579 and subdivided by the reserved character `.', and a ComponentPath is 580 introduced and subdivided by the reserved character `/'. In this 581 example, 583 http://example.org/ark:12025/654xz321/s3/f8.05v.tiff 585 the string "/s3/f8" is a ComponentPath and the string ".05v.tiff" is 586 a VariantPath. The ARK Qualifier is a formalization of some 587 currently mainstream URL syntax conventions. This formalization 588 specifically reserves meanings that permit recipients to make strong 589 inferences about logical sub-object containment and equivalence based 590 only on the form of the received identifiers; there is great 591 efficiency in not having to inspect metadata records to discover such 592 relationships. NMAs are free not to disclose any of these 593 relationships merely by avoiding the reserved characters above. 594 Hierarchical components and variants are discussed further in the 595 next two sections. 597 The Qualifier, if present, differs from the Name in several important 598 respects. First, a Qualifier may have been assigned either by the 599 NAA or later by the NMA. The assignment of a Qualifier by an NMA 600 effectively amounts to an act of publishing a service entry point 601 within the conceptual object originally named by the NAA. For our 602 purposes, an ARK extended with a Qualifier assigned by an NMA will be 603 called an NMA-qualified ARK. 605 Second, a Qualifier assignment on the part of an NMA is made in 606 fulfillment of its service obligations and may reflect changing 607 service expectations and technology requirements. NMA-qualified ARKs 608 could therefore be transient, even if the base, unqualified ARK is 609 persistent. For example, it would be reasonable for an NMA to 610 support access to an image object through an actionable ARK that is 611 considered persistent even if the experience of that access changes 612 as linking, labeling, and presentation conventions evolve and as 613 format and security standards are updated. For an image "thumbnail", 614 that NMA could also support an NMA-qualified ARK that is considered 615 impersistent because the thumbnail will be replaced with higher 616 resolution images as network bandwidth and CPU speeds increase. At 617 the same time, for an originally scanned, high-resolution master, the 618 NMA could publish an NMA-qualfied ARK that is itself considered 619 persistent. Of course, the NMA must be able to return its separate 620 commitments to unqualified, NAA-assigned ARKs, to NMA-qualified ARKs, 621 and to any NAA-qualified ARKs that it supports. 623 A third difference between a Qualifier and a Name concerns the 624 semantic opaqueness constraint. When an NMA-qualified ARK is to be 625 used as a transient service entry point into a persistent object, the 626 priority given to semantic opaqueness observed by the NAA in the Name 627 part may be relaxed by the NMA in the Qualifier part. If service 628 priorities in the Qualifier take precedence over persistence, short- 629 term usability considerations may recommend somewhat semantically 630 laden Qualifier strings. 632 Finally, not only is the set of Qualifiers supported by an NMA 633 mutable, but different NMAs may support different Qualifier sets for 634 the same NAA-identified object. In this regard the NMAs act 635 independently of each other and of the NAA. 637 The next two sections describe how ARK syntax may be used to declare, 638 or to avoid declaring, certain kinds of relatedness among qualified 639 ARKs. 641 2.5.1. ARKs that Reveal Object Hierarchy 643 An NAA or NMA may choose to reveal the presence of a hierarchical 644 relationship between objects using the `/' (slash) character after 645 the Name part of an ARK. Some authorities will choose not to 646 disclose this information, while others will go ahead and disclose so 647 that manipulators of large sets of ARKs can infer object 648 relationships by simple identifier inspection; for example, this 649 makes it possible for a system to present a collapsed view of a large 650 search result set. 652 If the ARK contains an internal slash after the NAAN, the piece to 653 its left indicates a containing object. For example, publishing an 654 ARK of the form, 656 ark:12025/654/xz/321 658 is equivalent to publishing three ARKs, 659 ark:12025/654/xz/321 660 ark:12025/654/xz 661 ark:12025/654 663 together with a declaration that the first object is contained in the 664 second object, and that the second object is contained in the third. 666 Revealing the presence of hierarchy is completely up to the assigner 667 (NMA or NAA). It is hard enough to commit to one object's name, let 668 alone to three objects' names and to a specific, ongoing relatedness 669 among them. Thus, regardless of whether hierarchy was present 670 initially, the assigner, by not using slashes, reveals no shared 671 inferences about hierarchical or other inter-relatedness in the 672 following ARKs: 674 ark:12025/654_xz_321 675 ark:12025/654_xz 676 ark:12025/654xz321 677 ark:12025/654xz 678 ark:12025/654 680 Note that slashes around the ARK's NAAN (/12025/ in these examples) 681 are not part of the ARK's Name and therefore do not indicate the 682 existence of some sort of NAAN super object containing all objects in 683 its namespace. A slash must have at least one non-structural 684 character (one that is neither a slash nor a period) on both sides in 685 order for it to separate recognizable structural components. So 686 initial or final slashes may be removed, and double slashes may be 687 converted into single slashes. 689 2.5.2. ARKs that Reveal Object Variants 691 An NAA or NMA may choose to reveal the possible presence of variant 692 objects or object components using the `.' (period) character after 693 the Name part of an ARK. Some authorities will choose not to 694 disclose this information, while others will go ahead and disclose so 695 that manipulators of large sets of ARKs can infer object 696 relationships by simple identifier inspection; for example, this 697 makes it possible for a system to present a collapsed view of a large 698 search result set. 700 If the ARK contains an internal period after Name, the piece to its 701 left is a root name and the piece to its right, and up to the end of 702 the ARK or to the next period is a suffix. A Name may have more than 703 one suffix, for example, 704 ark:12025/654.24 705 ark:12025/xz4/654.24 706 ark:12025/654.20v.78g.f55 708 There are two main rules. First, if two ARKs share the same root 709 name but have different suffixes, the corresponding objects were 710 considered variants of each other (different formats, languages, 711 versions, etc.) by the assigner (NMA or NAA). Thus, the following 712 ARKs are variants of each other: 714 ark:12025/654.20v.78g.f55 715 ark:12025/654.321xz 716 ark:12025/654.44 718 Second, publishing an ARK with a suffix implies the existence of at 719 least one variant identified by the ARK without its suffix. The ARK 720 otherwise permits no further assumptions about what variants might 721 exist. So publishing the ARK, 723 ark:12025/654.20v.78g.f55 725 is equivalent to publishing the four ARKs, 727 ark:12025/654.20v.78g.f55 728 ark:12025/654.20v.78g 729 ark:12025/654.20v 730 ark:12025/654 732 Revealing the possibility of variants is completely up to the 733 assigner. It is hard enough to commit to one object's name, let 734 alone to multiple variants' names and to a specific, ongoing 735 relatedness among them. The assigner is the sole arbiter of what 736 constitutes a variant within its namespace, and whether to reveal 737 that kind of relatedness by using periods within its names. 739 A period must have at least one non-structural character (one that is 740 neither a slash nor a period) on both sides in order for it to 741 separate recognizable structural components. So initial or final 742 periods may be removed, and adjacent periods may be converted into a 743 single period. Multiple suffixes should be arranged in sorted order 744 (pure ASCII collating sequence) at the end of an ARK. 746 2.6. Character Repertoires 748 The Name and Qualifier parts are strings of visible ASCII characters. 749 For received ARKs, implementations must support a minimum length of 750 255 octets for the string composed of the Base ARK plus Qualifier. 751 Implementations generating strings exceeding this length should 752 understand that receiving implementations may not be able to index 753 such ARKs properly. Characters may be letters, digits, or any of 754 these seven characters: 756 = ~ * + @ _ $ 758 The following characters may also be used, but their meanings are 759 reserved: 761 % - . / 763 The characters `/' and `.' are ignored if either appears as the last 764 character of an ARK. If used internally, they allow a name assigner 765 to reveal object hierarchy and object variants as previously 766 described. 768 Hyphens are considered to be insignificant and are always ignored in 769 ARKs. A `-' (hyphen) may appear in an ARK for readability, or it may 770 have crept in during the formatting and wrapping of text, but it must 771 be ignored in lexical comparisons. As in a telephone number, hyphens 772 have no meaning in an ARK. It is always safe for an NMA that 773 receives an ARK to remove any hyphens found in it. As a result, like 774 the NMAH, hyphens are "identity inert" in comparing ARKs for 775 equivalence. For example, the following ARKs are equivalent for 776 purposes of comparison and ARK service access: 778 ark:12025/65-4-xz-321 779 http://sneezy.dopey.com/ark:12025/654--xz32-1 780 ark:12025/654xz321 782 The `%' character is reserved for %-encoding all other octets that 783 would appear in the ARK string, in the same manner as for URIs 784 [RFC3986]. A %-encoded octet consists of a `%' followed by two hex 785 digits; for example, "%7d" stands in for `}'. Lower case hex digits 786 are preferred to reduce the chances of false acronym recognition; 787 thus it is better to use "%acT" instead of "%ACT". The character `%' 788 itself must be represented using "%25". As with URNs, %-encoding 789 permits ARKs to support legacy namespaces (e.g., ISBN, ISSN, SICI) 790 that have less restricted character repertoires [RFC2288]. 792 2.7. Normalization and Lexical Equivalence 794 To determine if two or more ARKs identify the same object, the ARKs 795 are compared for lexical equivalence after first being normalized. 796 Since ARK strings may appear in various forms (e.g., having different 797 NMAHs), normalizing them minimizes the chances that comparing two ARK 798 strings for equality will fail unless they actually identify 799 different objects. In a specified-host ARK (one having an NMAH), the 800 NMAH never participates in such comparisons. Normalization described 801 here serves to define lexical equivalence but does not restrict how 802 implementors normalize ARKs locally for storage. 804 Normalization of a received ARK for the purpose of octet-by-octet 805 equality comparison with another ARK consists of the following steps. 807 1. The NMAH part (eg, everything from an initial "http://" up to the 808 next slash), if present is removed. 810 2. Any URI query string is removed (everything from the first 811 literal '?' to the end of the string). 813 3. The first case-insensitive match on "ark:/" or "ark:" is 814 converted to "ark:" (replacing any upper case letters and 815 removing any terminal '/'). 817 4. In the string that remains, the two characters following every 818 occurrence of `%' are converted to lower case. The case of all 819 other letters in the ARK string must be preserved. 821 5. All hyphens, are removed. 823 6. If normalization is being done as part of a resolution step, and 824 if the end of the remaining string matches a known inflection, 825 the inflection is noted and removed. 827 7. Structural characters (slash and period) are normalized: initial 828 and final occurrences are removed, and two structural characters 829 in a row (e.g., // or ./) are replaced by the first character, 830 iterating until each occurrence has at least one non-structural 831 character on either side. 833 8. If there are any components with a period on the left and a slash 834 on the right, either the component and the preceding period must 835 be moved to the end of the Name part or the ARK must be thrown 836 out as malformed. 838 9. The final step is to arrange the suffixes in ASCII collating 839 sequence (that is, to sort them) and to remove duplicate 840 suffixes, if any. It is also permissible to throw out ARKs for 841 which the suffixes are not sorted. 843 The resulting ARK string is now normalized. Comparisons between 844 normalized ARKs are case-sensitive, meaning that upper case letters 845 are considered different from their lower case counterparts. 847 To keep ARK string variation to a minimum, no reserved ARK characters 848 should be %-encoded unless it is deliberately to conceal their 849 reserved meanings. No non-reserved ARK characters should ever be 850 %-encoded. Finally, no %-encoded character should ever appear in an 851 ARK in its decoded form. 853 3. Naming Considerations 855 The most important threats faced by persistence providers include 856 such things as funding loss, natural disaster, political and social 857 upheaval, processing faults, and errors in human oversight. There is 858 nothing that an identifer scheme can do about such things. Still, a 859 few observed identifier failures and inconveniences can be traced 860 back to naming practices that we now know to be less than optimal for 861 persistence. 863 3.1. ARKS Embedded in Language 865 The ARK has different goals from the URI, so it has different 866 character set requirements. Because linguistic constructs imperil 867 persistence, for ARKs non-ASCII character support is unimportant. 868 ARKs and URIs share goals of transcribability and transportability 869 within web documents, so characters are required to be visible, non- 870 conflicting with HTML/XML syntax, and not subject to tampering during 871 transmission across common transport gateways. Add the goal of 872 making an undelimited ARK recognizable in running prose, as in 873 ark:12025/=@_22*$, and certain punctuation characters (e.g., comma, 874 period) end up being excluded from the ARK lest the end of a phrase 875 or sentence be mistaken for part of the ARK. 877 This consideration has more direct effect on ARK usability in a 878 natural language context than it has on ARK persistence. The same is 879 true of the rule preventing hyphens from having lexical significance. 880 It is fine to publish ARKs with hyphens in them (e.g., such as the 881 output of UUID/GUID generators), but the uniform treatment of hyphens 882 as insignificant reduces the possibility of users transcribing 883 identifiers that will have been broken through unpredictable 884 hyphenation by word processors. Any measure that reduces user 885 irritation with an identifier will increase its chances of survival. 887 3.2. Objects Should Wear Their Identifiers 889 A valuable technique for provision of persistent objects is to try to 890 arrange for the complete identifier to appear on, with, or near its 891 retrieved object. An object encountered at a moment in time when its 892 discovery context has long since disappeared could then easily be 893 traced back to its metadata, to alternate versions, to updates, etc. 894 This has seen reasonable success, for example, in book publishing and 895 software distribution. An identifier string only has meaning when 896 its association is known, and this a very sure, simple, and low-tech 897 method of reminding everyone exactly what that association is. 899 3.3. Names are Political, not Technological 901 If persistence is the goal, a deliberate local strategy for 902 systematic name assignment is crucial. Names must be chosen with 903 great care. Poorly chosen and managed names will devastate any 904 persistence strategy, and they do not discriminate by identifier 905 scheme. Whether a mistakenly re-assigned name is a URN, DOI, PURL, 906 URL, or ARK, the damage -- failed access and confusion -- is not 907 mitigated more in one scheme than in another. Conversely, in-house 908 efforts to manage names responsibly will go much further towards 909 safeguarding persistence than any choice of naming scheme or name 910 resolution technology. 912 Branding (e.g., at the corporate or departmental level) is important 913 for funding and visibility, but substrings representing brands and 914 organizational names should be given a wide berth except when 915 absolutely necessary in the hostname (the identity-inert) part of the 916 ARK. These substrings are not only unstable because organizations 917 change frequently, but they are also dangerous because successor 918 organizations often have political or legal reasons to actively 919 suppress predecessor names and brands. Any measure that reduces the 920 chances of future political or legal pressure on an identifier will 921 decrease the chances that our descendants will be obliged to 922 deliberately break it. 924 3.4. Choosing a Hostname or NMA 926 Hostnames appearing in any identifier meant to be persistent must be 927 chosen with extra care. The tendency in hostname selection has 928 traditionally been to choose a token with recognizable attributes, 929 such as a corporate brand, but that tendency wreaks havoc with 930 persistence that is supposed to outlive brands, corporations, subject 931 classifications, and natural language semantics (e.g., what did the 932 three letters "gay" mean in 1958, 1978, and 1998?). Today's 933 recognized and correct attributes are tomorrow's stale or incorrect 934 attributes. In making hostnames (any names, actually) long-term 935 persistent, it helps to eliminate recognizable attributes to the 936 extent possible. This affects selection of any name based on URLs, 937 including PURLs and the explicitly disposable NMAHs. 939 There is no excuse for a provider that manages its internal names 940 impeccably not to exercise the same care in choosing what could be an 941 exceptionally durable hostname, especially if it would form the 942 prefix for all the provider's URL-based external names. Registering 943 an opaque hostname in the ".org" or ".net" domain would not be a bad 944 start. Another way is to publish your ARKs with an organizational 945 domain name that will be mapped by DNS to an appropriate NMA host. 946 This makes for shorter names with less branding vulnerability. 948 It is a mistake to think that hostnames are inherently unstable. If 949 you require brand visibility, that may be a fact of life. But things 950 are easier if yours is the brand of long-lived cultural memory 951 institution such as a national or university library or archive. 952 Well-chosen hostnames from organizations that are sheltered from the 953 direct effects of a volatile marketplace can easily provide longer- 954 lived global resolvers than the domain names explicitly or implicitly 955 used as starting points for global resolution by indirection-based 956 persistent identifier schemes. For example, it is hard to imagine 957 circumstances under which the Library of Congress' domain name would 958 disappear sooner than, say, "handle.net". 960 For smaller libraries, archives, and preservation organizations, 961 there is a natural concern about whether they will be able to keep 962 their web servers and domain names in the face of uncertain funding. 963 One option is to form or join a consortium [N2T] of like-minded 964 organizations with the purpose of providing mutual preservation 965 support. The first goal of such a consortium would be to perpetually 966 rent a hostname on which to establish a web server that simply 967 redirects incoming member organization requests to the appropriate 968 member server; using ARKs, for example, a 150-member consortium could 969 run a very small server (24x7) that contained nothing more than 150 970 rewrite rules in its configuration file. Even more helpful would be 971 additional consortial support for a member organization that was 972 unable to continue providing services and needed to find a successor 973 archival organization. This would be a low-cost, low-tech way to 974 publish ARKs (or URLs) under highly persistent hostnames. 976 There are no obvious reasons why the organizations registering DNS 977 names, URN Namespaces, and DOI publisher IDs should have among them 978 one that is intrinsically more fallible than the next. Moreover, it 979 is a misconception that the demise of DNS and of HTTP need adversely 980 affect the persistence of URLs. At such a time, certainly URLs from 981 the present day might not then be actionable by our present-day 982 mechanisms, but resolution systems for future non-actionable URLs are 983 no harder to imagine than resolution systems for present-day non- 984 actionable URNs and DOIs. There is no more stable a namespace than 985 one that is dead and frozen, and that would then characterize the 986 space of names bearing the "http://" prefix. It is useful to 987 remember that just because hostnames have been carelessly chosen in 988 their brief history does not mean that they are unsuitable in NMAHs 989 (and URLs) intended for use in situations demanding the highest level 990 of persistence available in the Internet environment. A well-planned 991 name assignment strategy is everything. 993 3.5. Assigners of ARKs 995 A Name Assigning Authority (NAA) is an organization that creates (or 996 delegates creation of) long-term associations between identifiers and 997 information objects. Examples of NAAs include national libraries, 998 national archives, and publishers. An NAA may arrange with an 999 external organization for identifier assignment. The US Library of 1000 Congress, for example, allows OCLC (the Online Computer Library 1001 Center, a major world cataloger of books) to create associations 1002 between Library of Congress call numbers (LCCNs) and the books that 1003 OCLC processes. A cataloging record is generated that testifies to 1004 each association, and the identifier is included by the publisher, 1005 for example, in the front matter of a book. 1007 An NAA does not so much create an identifier as create an 1008 association. The NAA first draws an unused identifier string from 1009 its namespace, which is the set of all identifiers under its control. 1010 It then records the assignment of the identifier to an information 1011 object having sundry witnessed characteristics, such as a particular 1012 author and modification date. A namespace is usually reserved for an 1013 NAA by agreement with recognized community organizations (such as 1014 IANA and ISO) that all names containing a particular string be under 1015 its control. In the ARK an NAA is represented by the Name Assigning 1016 Authority Number (NAAN). 1018 The ARK namespace reserved for an NAA is the set of names bearing its 1019 particular NAAN. For example, all strings beginning with 1020 "ark:12025/" are under control of the NAA registered under 12025, 1021 which might be the National Library of Finland. Because each NAA has 1022 a different NAAN, names from one namespace cannot conflict with those 1023 from another. Each NAA is free to assign names from its namespace 1024 (or delegate assignment) according to its own policies. These 1025 policies must be documented in a manner similar to the declarations 1026 required for URN Namespace registration [RFC2611]. 1028 Organizations can request or update a NAAN by filling out a form 1029 [NAANrequest]. 1031 3.6. NAAN Namespace Management 1033 Every NAA must have a namespace management strategy. A time-honored 1034 technique is to hierarchically partition a namespace into 1035 subnamespaces using prefixes that guarantee non-collision of names in 1036 different partition. This practice is strongly encouraged for all 1037 NAAs, especially when subnamespace management will be delegated to 1038 other departments, units, or projects within an organization. For 1039 example, with a NAAN that is assigned to a university and managed by 1040 its main library, care should be taken to reserve semantically opaque 1041 prefixes that will set aside large parts of the unused namespace for 1042 future assignments. Prefix-based partition management is an 1043 important responsibility of the NAA. 1045 This sort of delegation by prefix is well-used in the formation of 1046 DNS names and ISBN identifiers. An important difference is that in 1047 the former, the hierarchy is deliberately exposed and in the latter 1048 it is hidden. Rather than using lexical boundary markers such as the 1049 period (`.') found in domain names, the ISBN uses a publisher prefix 1050 but doesn't disclose where the prefix ends and the publisher's 1051 assigned name begins. This practice of non-disclosure, borrowed from 1052 the ISBN and ISSN schemes, is encouraged in assigning ARKs, because 1053 it reduces the visibility of an assertion that is probably not 1054 important now and may become a vulnerability later. 1056 Reasonable prefixes for assigned names usually consist of consonants 1057 and digits and are 1-5 characters in length. For example, the 1058 constant prefix "x9t" might be delegated to a book digitization 1059 project that creates identifiers such as 1061 http://444.berkeley.edu/ark:28722/x9t38rk45c 1063 If longevity is the goal, it is important to keep the prefixes free 1064 of recognizable semantics; for example, using an acronym representing 1065 a project or a department is discouraged. At the same time, you may 1066 wish to set aside a subnamespace for testing purposes under a prefix 1067 such as "fk..." that can serve as a visual clue and reminder to 1068 maintenance staff that this "fake" identifier was never published. 1070 There are other measures one can take to avoid user confusion, 1071 transcription errors, and the appearance of accidental semantics when 1072 creating identifiers. If you are generating identifiers 1073 automatically, pure numeric identifiers are likeley to be 1074 semantically opaque enough, but it's probably useful to avoid leading 1075 zeroes because some users mistakenly treat them as optional, thinking 1076 (arithmetically) that they don't contribute to the "value" of the 1077 identifier. 1079 If you need lots of identifiers and you don't want them to get too 1080 long, you can mix digits with consonants (but avoid vowels since they 1081 might accidentally spell words) to get more identifiers without 1082 increasing the string length. In this case you may not want more 1083 than a two letters in a row because it reduces the chance of 1084 generating acronyms. Generator tools such as [NOID] provide support 1085 for these sorts of identifiers, and can also add a computed check 1086 character as a guarantee against the most common transcription 1087 errors. 1089 3.7. Sub-Object Naming 1091 As mentioned previously, semantically opaque identifiers are very 1092 useful for long-term naming of abstract objects, however, it may be 1093 appropriate to extend these names with less opaque extensions that 1094 reference contemporary service entry points (sub-objects) in support 1095 of the object. Sub-object extensions beginning with a digit or 1096 underscore (`_') are reserved for the possibilty of developing a 1097 future registry of canonical service points (e.g., numeric references 1098 to versions, formats, languages, etc). 1100 4. Finding a Name Mapping Authority 1102 In order to derive an actionable identifier (these days, a URL) from 1103 an ARK, a hostport (hostname or hostname plus port combination) for a 1104 working Name Mapping Authority (NMA) must be found. An NMA is a 1105 service that is able to respond to the three basic ARK service 1106 requests. Relying on registration and client-side discovery, NMAs 1107 make known which NAAs' identifiers they are willing to service. 1109 Upon encountering an ARK, a user (or client software) looks inside it 1110 for the optional NMAH part (the hostport of the NMA's ARK service). 1111 If it contains an NMAH that is working, this NMAH discovery step may 1112 be skipped; the NMAH effectively uses the beginning of an ARK to 1113 cache the results of a prior mapping authority discovery process. If 1114 a new NMAH needs to found, the client looks inside the ARK again for 1115 the NAAN (Name Assigning Authority Number). Querying a global 1116 database, it then uses the NAAN to look up all current NMAHs that 1117 service ARKs issued by the identified NAA. 1119 The global database is key, and ideally the lookup would be automatic 1120 and transparent to the user. For this, the most promising method is 1121 probably the Name-to-Thing (N2T) Resolver [N2T] at n2t.net. It is a 1122 proposed low-cost, highly reliable, consortially maintained NMAH that 1123 simply exists to support actionable HTTP-based URLs for as long as 1124 HTTP is used. One of its big advantages over the other two methods 1125 and the URN, Handle, DOI, and PURL methods, is that N2T addresses the 1126 namespace splitting problem. When objects maintained by one NMA are 1127 inherited by more than one successor NMA, until now one of those 1128 successors would be required to maintain forwarding tables on behalf 1129 of the other successors. 1131 There are two other ways to discover an NMAH, one of them described 1132 in a subsection below. Another way, described in an appendix, is 1133 based on a simplification of the URN resolver discovery method, 1134 itself very similar in principle to the resolver discovery method 1135 used by Handles and DOIs. None of these methods does more than what 1136 can be done with a very small, consortially maintained web server 1137 such as [N2T]. 1139 In the interests of long-term persistence, however, ARK mechanisms 1140 are first defined in high-level, protocol-independent terms so that 1141 mechanisms may evolve and be replaced over time without compromising 1142 fundamental service objectives. Either or both specific methods 1143 given here may eventually be supplanted by better methods since, by 1144 design, the ARK scheme does not depend on a particular method, but 1145 only on having some method to locate an active NMAH. 1147 At the time of issuance, at least one NMAH for an ARK should be 1148 prepared to service it. That NMA may or may not be administered by 1149 the Name Assigning Authority (NAA) that created it. Consider the 1150 following hypothetical example of providing long-term access to a 1151 cancer research journal. The publisher wishes to turn a profit and 1152 the National Library of Medicine wishes to preserve the scholarly 1153 record. An agreement might be struck whereby the publisher would act 1154 as the NAA and the national library would archive the journal issue 1155 when it appears, but without providing direct access for the first 1156 six months. During the first six months of peak commercial 1157 viability, the publisher would retain exclusive delivery rights and 1158 would charge access fees. Again, by agreement, both the library and 1159 the publisher would act as NMAs, but during that initial period the 1160 library would redirect requests for issues less than six months old 1161 to the publisher. At the end of the waiting period, the library 1162 would then begin servicing requests for issues older than six months 1163 by tapping directly into its own archives. Meanwhile, the publisher 1164 might routinely redirect incoming requests for older issues to the 1165 library. Long-term access is thereby preserved, and so is the 1166 commercial incentive to publish content. 1168 Although it will be common for an NAA also to run an NMA service, it 1169 is never a requirement. Over time NAAs and NMAs will come and go. 1170 One NMA will succeed another, and there might be many NMAs serving 1171 the same ARKs simultaneously (e.g., as mirrors or as competitors). 1172 There might also be asymmetric but coordinated NMAs as in the 1173 library-publisher example above. 1175 4.1. Looking Up NMAHs in a Globally Accessible File 1177 This subsection describes a way to look up NMAHs using a simple name 1178 authority table represented as a plain text file. For efficient 1179 access the file may be stored in a local filesystem, but it needs to 1180 be reloaded periodically to incorporate updates. It is not expected 1181 that the size of the file or frequency of update should impose an 1182 undue maintenance or searching burden any time soon, for even 1183 primitive linear search of a file with ten-thousand NAAs is a 1184 subsecond operation on modern server machines. The proposed file 1185 strategy is similar to the /etc/hosts file strategy that supported 1186 Internet host address lookup for a period of years before the advent 1187 of DNS. 1189 The name authority table file is updated on an ongoing basis and is 1190 available for copying over the internet from a number of mirror sites 1191 [NAANregistry]. The file contains comment lines (lines that begin 1192 with `#') explaining the format and giving the file's modification 1193 time, reloading address, and NAA registration instructions. There is 1194 even a Perl script that processes the file embedded in the file's 1195 comments. The currently registered Name Assigning Authorities are: 1197 12025 National Library of Medicine 1198 12026 Library of Congress 1199 12027 National Agriculture Library 1200 13030 California Digital Library 1201 13038 World Intellectual Property Organization 1202 20775 University of California San Diego 1203 29114 University of California San Francisco 1204 28722 University of California Berkeley 1205 21198 University of California Los Angeles 1206 15230 Rutgers University 1207 13960 Internet Archive 1208 64269 Digital Curation Centre 1209 62624 New York University 1210 67531 University of North Texas 1211 27927 Ithaka Electronic-Archiving Initiative 1212 12148 Bibliotheque nationale de France 1213 / National Library of France 1214 78319 Google 1215 88435 Princeton University 1216 78428 University of Washington 1217 89901 Archives of the Region of Vaestra Goetaland 1218 and City of Gothenburg, Sweden 1219 80444 Northwest Digital Archives 1220 25593 Emory University 1221 25031 University of Kansas 1222 17101 Centre for Ecology & Hydrology, UK 1223 65323 University of Calgary 1224 61001 University of Chicago 1225 52327 Bibliotheque et Archives Nationales du Quebec 1226 / National Libary and Archives of Quebec 1227 39331 National Szechenyi Library / National Library of Hungary 1228 26677 Library and Archives Canada / Bibliotheque et Archives Canada 1230 5. Generic ARK Service Definition 1232 An ARK request's output is delivered information; examples include 1233 the object itself, a policy declaration (e.g., a promise of support), 1234 a descriptive metadata record, or an error message. The experience 1235 of object delivery is expected to be an evolving mix of information 1236 that reflects changing service expectations and technology 1237 requirements; contemporary examples include such things as an object 1238 summary and component links formatted for human consumption. ARK 1239 services must be couched in high-level, protocol-independent terms if 1240 persistence is to outlive today's networking infrastructural 1241 assumptions. The high-level ARK service definitions listed below are 1242 followed in the next section by a concrete method (one of many 1243 possible methods) for delivering these services with today's 1244 technology. 1246 5.1. Generic ARK Access Service (access, location) 1248 Returns (a copy of) the object or a redirect to the same, although a 1249 sensible object proxy may be substituted. Examples of sensible 1250 substitutes include, 1252 o a table of contents instead of a large complex document, 1254 o a home page instead of an entire web site hierarchy, 1256 o a rights clearance challenge before accessing protected data, 1258 o directions for access to an offline object (e.g., a book), 1260 o a description of an intangible object (a disease, an event), or 1262 o an applet acting as "player" for a large multimedia object. 1264 May also return a discriminated list of alternate object locators. 1265 If access is denied, returns an explanation of the object's current 1266 (perhaps permanent) inaccessibility. 1268 5.1.1. Generic Policy Service (permanence, naming, etc.) 1270 Returns declarations of policy and support commitments for given 1271 ARKs. Declarations are returned in either a structured metadata 1272 format or a human readable text format; sometimes one format may 1273 serve both purposes. Policy subareas may be addressed in separate 1274 requests, but the following areas should should be covered: object 1275 permanence, object naming, object fragment addressing, and 1276 operational service support. 1278 The permanence declaration for an object is a rating defined with 1279 respect to an identified permanence provider (guarantor), which will 1280 be the NMA. It may include the following aspects. 1282 (a) "object availability" -- whether and how access to the object 1283 is supported (e.g., online 24x7, or offline only), 1285 (b) "identifier validity" -- under what conditions the identifier 1286 will be or has been re-assigned, 1288 (c) "content invariance" -- under what conditions the content of 1289 the object is subject to change, and 1291 (d) "change history" -- access to corrections, migrations, and 1292 revisions, whether through links to the changed objects themselves 1293 or through a document summarizing the change history 1295 A recent approach to persistence statements, conceived independently 1296 from ARKs, can be found at [PStatements], with ongoing work available 1297 at Appendix A. An older approach to a permanence rating framework is 1298 given in [NLMPerm], which identified the following "permanence 1299 levels": 1301 Not Guaranteed: No commitment has been made to retain this 1302 resource. It could become unavailable at any time. Its 1303 identifier could be changed. 1305 Permanent: Dynamic Content: A commitment has been made to keep 1306 this resource permanently available. Its identifier will always 1307 provide access to the resource. Its content could be revised or 1308 replaced. 1310 Permanent: Stable Content: A commitment has been made to keep this 1311 resource permanently available. Its identifier will always 1312 provide access to the resource. Its content is subject only to 1313 minor corrections or additions. 1315 Permanent: Unchanging Content: A commitment has been made to keep 1316 this resource permanently available. Its identifier will always 1317 provide access to the resource. Its content will not change. 1319 Naming policy for an object includes an historical description of the 1320 NAA's (and its successor NAA's) policies regarding differentiation of 1321 objects. Since it the NMA who responds to requests for policy 1322 statements, it is useful for the NMA to be able to produce or 1323 summarize these historical NAA documents. Naming policy may include 1324 the following aspects. 1326 (i) "similarity" -- (or "unity") the limit, defined by the NAA, to 1327 the level of dissimilarity beyond which two similar objects 1328 warrant separate identifiers but before which they share one 1329 single identifier, and 1331 (ii) "granularity" -- the limit, defined by the NAA, to the level 1332 of object subdivision beyond which sub-objects do not warrant 1333 separately assigned identifiers but before which sub-objects are 1334 assigned separate identifiers. 1336 Subnaming policy for an object describes the qualifiers that the NMA, 1337 in fulfilling its ongoing and evolving service obligations, allows as 1338 extensions to an NAA-assigned ARK. To the conceptual object that the 1339 NAA named with an ARK, the NMA may add component access points and 1340 derivatives (e.g., format migrations in aid of preservation) in order 1341 to provide both basic and value-added services. 1343 Addressing policy for an object includes a description of how, during 1344 access, object components (e.g., paragraphs, sections) or views 1345 (e.g., image conversions) may or may not be "addressed", in other 1346 words, how the NMA permits arguments or parameters to modify the 1347 object delivered as the result of an ARK request. If supported, 1348 these sorts of operations would provide things like byte-ranged 1349 fragment delivery and open-ended format conversions, or any set of 1350 possible transformations that would be too numerous to list or to 1351 identify with separately assigned ARKs. 1353 Operational service support policy includes a description of general 1354 operational aspects of the NMA service, such as after-hours staffing 1355 and trouble reporting procedures. 1357 5.1.2. Generic Description Service 1359 Returns a description of the object. Descriptions are returned in a 1360 structured metadata format, human readable text format, or in one 1361 format that serves both purposes (such as human-readable HTML with 1362 embedded machine-readable metadata). A description must at a minimum 1363 answer the who, what, when, and where questions concerning an 1364 expression of the object. Standalone descriptions should be 1365 accompanied by the modification date and source of the description 1366 itself. May also return discriminated lists of ARKs that are related 1367 to the given ARK. 1369 5.2. Overview of The HTTP URL Mapping Protocol (THUMP) 1371 The HTTP URL Mapping Protocol (THUMP) is a way of taking a key (any 1372 identifier) and asking such questions as, what information does this 1373 identify and how permanent is it? [THUMP] is in fact one specific 1374 method under development for delivering ARK services. The protocol 1375 runs over HTTP to exploit the web browser's current pre-eminence as 1376 user interface to the Internet. THUMP is designed so that a person 1377 can enter ARK requests directly into the location field of current 1378 browser interfaces. Because it runs over HTTP, THUMP can be 1379 simulated and tested via keyboard-based interactions [RFC0854]. 1381 The asker (a person or client program) starts with an identifier, 1382 such as an ARK or a URL. The identifier reveals to the asker (or 1383 allows the asker to infer) the Internet host name and port number of 1384 a server system that responds to questions. Here, this is just the 1385 NMAH that is obtained by inspection and possibly lookup based on the 1386 ARK's NAAN. The asker then sets up an HTTP session with the server 1387 system, sends a question via a THUMP request (contained within an 1388 HTTP request), receives an answer via a THUMP response (contained 1389 within an HTTP response), and closes the session. That concludes the 1390 connected portion of the protocol. 1392 A THUMP request is a string of characters beginning with a `?' 1393 (question mark) that is appended to the identifier string. The 1394 resulting string is sent as an argument to HTTP's GET command. 1395 Request strings too long for GET may be sent using HTTP's POST 1396 command. The three most common requests correspond to three 1397 degenerate special cases that keep the user's learning and typing 1398 burden low. First, a simple key with no request at all is the same 1399 as an ordinary access request. Thus a plain ARK entered into a 1400 browser's location field behaves much like a plain URL, and returns 1401 access to the primary identified object, for instance, an HTML 1402 document. 1404 The second special case is a minimal ARK description request string 1405 consisting of just "?". For example, entering the string, 1407 ark.nlm.nih.gov/12025/psbbantu? 1409 into the browser's location field directly precipitates a request for 1410 a metadata record describing the object identified by ark:12025/ 1411 psbbantu. The browser, unaware of THUMP, prepares and sends an HTTP 1412 GET request in the same manner as for a URL. THUMP is designed so 1413 that the response (indicated by the returned HTTP content type) is 1414 normally displayed, whether the output is structured for machine 1415 processing (text/plain) or formatted for human consumption (text/ 1416 html). 1418 In the following example THUMP session, each line has been annotated 1419 to include a line number and whether it was the client or server that 1420 sent it. Without going into much depth, the session has four pieces 1421 separated from each other by blank lines: the client's piece (lines 1422 1-3), the server's HTTP/THUMP response headers (4-7), and the body of 1423 the server's response (8-13). The first and last lines (1 and 13) 1424 correspond to the client's steps to start the TCP session and the 1425 server's steps to end it, respectively. 1427 1 C: [opens session] 1428 C: GET http://ark.nlm.nih.gov/ark:12025/psbbantu? HTTP/1.1 1429 C: 1430 S: HTTP/1.1 200 OK 1431 5 S: Content-Type: text/plain 1432 S: THUMP-Status: 0.6 200 OK 1433 S: 1434 S: erc: 1435 S: who: Lederberg, Joshua 1436 10 S: what: Studies of Human Families for Genetic Linkage 1437 S: when: 1974 1438 S: where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1439 S: [closes session] 1441 The first two server response lines (4-5) above are typical of HTTP. 1442 The next line (6) is peculiar to THUMP, and indicates the THUMP 1443 version and a normal return status. 1445 The balance of the response consists of a single metadata record 1446 (8-12) that comprises the ARK description service response. The 1447 returned record is in the format of an Electronic Resource Citation 1448 [ERC], which is discussed in overview in the next section. For now, 1449 note that it contains four elements that answer the top priority 1450 questions regarding an expression of the object: who played a major 1451 role in expressing it, what the expression was called, when is was 1452 created, and where the expression may be found. This quartet of 1453 elements comes up again and again in ERCs. 1455 The third degenerate special case of an ARK request (and no other 1456 cases will be described in this document) is the string "??", 1457 corresponding to a minimal permanence policy request. It can be seen 1458 in use appended to an ARK (on line 2) in the example session that 1459 follows. 1461 1 C: [opens session] 1462 C: GET http://ark.nlm.nih.gov/ark:12025/psbbantu?? HTTP/1.1 1463 C: 1464 S: HTTP/1.1 200 OK 1465 5 S: Content-Type: text/plain 1466 S: THUMP-Status: 0.6 200 OK 1467 S: 1468 S: erc: 1469 S: who: Lederberg, Joshua 1470 10 S: what: Studies of Human Families for Genetic Linkage 1471 S: when: 1974 1472 S: where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1473 S: erc-support: 1474 S: who: USNLM 1475 15 S: what: Permanent, Unchanging Content 1476 S: when: 20010421 1477 S: where: http://ark.nlm.nih.gov/yy22948 1478 S: [closes session] 1480 Each segment in an ERC tells a different story relating to the 1481 object, so although the same four questions (elements) appear in 1482 each, the answers depend on the segment's story type. While the 1483 first segment tells the story of an expression of the object, the 1484 second segment tells the story of the support commitment made to it: 1485 who made the commitment, what the nature of the commitment was, when 1486 it was made, and where a fuller explanation of the commitment may be 1487 found. 1489 5.3. The Electronic Resource Citation (ERC) 1491 An Electronic Resource Citation (or ERC, pronounced e-r-c) [ERC] is a 1492 kind of object description that uses Dublin Core Kernel metadata 1493 elements [DCKernel]. The ERC with Kernel elements provides a simple, 1494 compact, and printable record for holding data associated with an 1495 information resource. As originally designed [Kernel], Kernel 1496 metadata balances the needs for expressive power, very simple machine 1497 processing, and direct human manipulation. 1499 The previous section shows two limited examples of what is fully 1500 described elsewhere [ERC]. The rest of this short section provides 1501 some of the background and rationale for this record format. 1503 A founding principle of Kernel metadata is that direct human contact 1504 with metadata will be a necessary and sufficient condition for the 1505 near term rapid development of metadata standards, systems, and 1506 services. Thus the machine-processable Kernel elements must only 1507 minimally strain people's ability to read, understand, change, and 1508 transmit ERCs without their relying on intermediation with 1509 specialized software tools. The basic ERC needs to be succinct, 1510 transparent, and trivially parseable by software. 1512 In the current Internet, it is natural seriously to consider using 1513 XML as an exchange format because of predictions that it will obviate 1514 many ad hoc formats and programs, and unify much of the world's 1515 information under one reliable data structuring discipline that is 1516 easy to generate, verify, parse, and render. It appears, however, 1517 that XML is still only catching on after years of standards work and 1518 implementation experience. The reasons for it are unclear, but for 1519 now very simple XML interpretation is still out of reach. Another 1520 important caution is that XML structures are hard on the eyeballs, 1521 taking up an amount of display (and page) space that significantly 1522 exceeds that of traditional formats. Until these conflicts with ERC 1523 principle are resolved, XML is not a first choice for representing 1524 ERCs. Borrowing instead from the data structuring format that 1525 underlies the successful spread of email and web services, the first 1526 ERC format uses [ANVL], which is based on email and HTTP headers 1527 [RFC2822]. There is a naturalness to ANVL's label-colon-value format 1528 (seen in the previous section) that barely needs explanation to a 1529 person beginning to enter ERC metadata. 1531 Besides simplicity of ERC system implementation and data entry 1532 mechanics, ERC semantics (what the record and its constituent parts 1533 mean) must also be easy to explain. ERC semantics are based on a 1534 reformulation and extension of the Dublin Core [RFC5013] hypothesis, 1535 which suggests that the fifteen Dublin Core metadata elements have a 1536 key role to play in cross-domain resource description. The ERC 1537 design recognizes that the Dublin Core's primary contribution is the 1538 international, interdisciplinary consensus that identified fifteen 1539 semantic buckets (element categories), regardless of how they are 1540 labeled. The ERC then adds a definition for a record and some 1541 minimal compliance rules. In pursuing the limits of simplicity, the 1542 ERC design combines and relabels some Dublin Core buckets to isolate 1543 a tiny kernel (subset) of four elements for basic cross-domain 1544 resource description. 1546 For the cross-domain kernel, the ERC uses the four basic elements -- 1547 who, what, when, and where -- to pretend that every object in the 1548 universe can have a uniform minimal description. Each has a name or 1549 other identifier, a location, some responsible person or party, and a 1550 date. It doesn't matter what type of object it is, or whether one 1551 plans to read it, interact with it, smoke it, wear it, or navigate 1552 it. Of course, this approach is flawed because uniformity of 1553 description for some object types requires more semantic contortion 1554 and sacrifice than for others. That is why at the beginning of this 1555 document, the ARK was said to be suited to objects that accommodate 1556 reasonably regular electronic description. 1558 While insisting on uniformity at the most basic level provides 1559 powerful cross-domain leverage, the semantic sacrifice is great for 1560 many applications. So the ERC also permits a semantically rich and 1561 nuanced description to co-exist in a record along with a basic 1562 description. In that way both sophisticated and naive recipients of 1563 the record can extract the level of meaning from it that best suits 1564 their needs and abilities. Key to unlocking the richer description 1565 is a controlled vocabulary of ERC record types (not explained in this 1566 document) that permit knowledgeable recipients to apply defined sets 1567 of additional assumptions to the record. 1569 5.4. Advice to Web Clients 1571 ARKs are envisaged to appear wherever durable object references are 1572 planned. Library cataloging records, literature citations, and 1573 bibliographies are important examples. In many of these places URLs 1574 (Uniform Resource Locators) are currently used, and inside some of 1575 those URLs are embedded URNs, Handles, and DOIs. Unfortunately, 1576 there's no suggestion of a way to probe for extra services that would 1577 build confidence in those identifiers; in other words, there's no way 1578 to tell whether any of those identifiers is any better managed than 1579 the average URL. 1581 ARKs are also envisaged to appear in hypertext links (where they are 1582 not normally shown to users) and in rendered text (displayed or 1583 printed). A normal HTML link for which the URL is not displayed 1584 looks like this. 1586 Click Here 1588 A URL with an embedded ARK invites access (via `?' and `??') to extra 1589 services: 1591 Click Here 1593 Using the [N2T] resolver to provide identifier-scheme-agnostic 1594 protection against hostname instability, this ARK could be published 1595 as: 1597 Click Here 1599 An NAA will typically make known the associations it creates by 1600 publishing them in catalogs, actively advertizing them, or simply 1601 leaving them on web sites for visitors (e.g., users, indexing 1602 spiders) to stumble across in browsing. 1604 5.5. Security Considerations 1606 The ARK naming scheme poses no direct risk to computers and networks. 1607 Implementors of ARK services need to be aware of security issues when 1608 querying networks and filesystems for Name Mapping Authority 1609 services, and the concomitant risks from spoofing and obtaining 1610 incorrect information. These risks are no greater for ARK mapping 1611 authority discovery than for other kinds of service discovery. For 1612 example, recipients of ARKs with a specified hostport (NMAH) should 1613 treat it like a URL and be aware that the identified ARK service may 1614 no longer be operational. 1616 Apart from mapping authority discovery, ARK clients and servers 1617 subject themselves to all the risks that accompany normal operation 1618 of the protocols underlying mapping services (e.g., HTTP, Z39.50). 1619 As specializations of such protocols, an ARK service may limit 1620 exposure to the usual risks. Indeed, ARK services may enhance a kind 1621 of security by helping users identify long-term reliable references 1622 to information objects. 1624 6. References 1626 [ANVL] Kunze, J. and B. Kahle, "A Name-Value Language", 2008, 1627 . 1629 [ARK] Kunze, J., "Towards Electronic Persistence Using ARK 1630 Identifiers", IWAW/ECDL Annual Workshop Proceedings 3rd, 1631 August 2003, 1632 . 1634 [ARKagency] 1635 ARKs-in-the-Open, "ARK Maintenance Agency", 2019, 1636 . 1638 [DCKernel] 1639 Initiative, D. C. M., "Kernel Metadata Working Group", 1640 2001-2008, . 1642 [DOI] Foundation, I. D., "The Digital Object Identifier (DOI) 1643 System", February 2001, . 1645 [ERC] Kunze, J. and A. Turner, "Kernel Metadata and Electronic 1646 Resource Citations", October 2007, 1647 . 1649 [Handle] Lannom, L., "Handle System Overview", ICSTI Forum No. 30, 1650 April 1999, . 1652 [Kernel] Kunze, J., "A Metadata Kernel for Electronic Permanence", 1653 Journal of Digital Information Vol 2, Issue 2, 1654 ISSN 1368-7506, January 2002, 1655 . 1657 [N2T] Library, C. D., "Name-to-Thing Resolver", August 2006, 1658 . 1660 [NAANregistry] 1661 ARKs.org, "NAAN Registry", 2019, 1662 . 1664 [NAANrequest] 1665 ARKs.org, "NAAN Request Form", 2018, 1666 . 1668 [NLMPerm] Byrnes, M., "Defining NLM's Commitment to the Permanence 1669 of Electronic Information", ARL 212:8-9, October 2000, 1670 . 1672 [NOID] Kunze, J., "Nice Opaque Identifiers", February 2005, 1673 . 1675 [PStatements] 1676 Kunze, J., "Persistence statements: describing digital 1677 stickiness", October 2016, 1678 . 1680 [PURL] Shafer, K., "Introduction to Persistent Uniform Resource 1681 Locators", 1996, . 1683 [RFC0854] Postel, J. and J. Reynolds, "Telnet Protocol 1684 Specification", STD 8, RFC 854, DOI 10.17487/RFC0854, May 1685 1983, . 1687 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 1688 STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987, 1689 . 1691 [RFC2141] Moats, R., "URN Syntax", RFC 2141, DOI 10.17487/RFC2141, 1692 May 1997, . 1694 [RFC2288] Lynch, C., Preston, C., and R. Daniel, "Using Existing 1695 Bibliographic Identifiers as Uniform Resource Names", 1696 RFC 2288, DOI 10.17487/RFC2288, February 1998, 1697 . 1699 [RFC2611] Daigle, L., van Gulik, D., Iannella, R., and P. Faltstrom, 1700 "URN Namespace Definition Mechanisms", BCP 33, RFC 2611, 1701 DOI 10.17487/RFC2611, June 1999, 1702 . 1704 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 1705 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 1706 Transfer Protocol -- HTTP/1.1", RFC 2616, 1707 DOI 10.17487/RFC2616, June 1999, 1708 . 1710 [RFC2822] Resnick, P., Ed., "Internet Message Format", RFC 2822, 1711 DOI 10.17487/RFC2822, April 2001, 1712 . 1714 [RFC2915] Mealling, M. and R. Daniel, "The Naming Authority Pointer 1715 (NAPTR) DNS Resource Record", RFC 2915, 1716 DOI 10.17487/RFC2915, September 2000, 1717 . 1719 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1720 Resource Identifier (URI): Generic Syntax", STD 66, 1721 RFC 3986, DOI 10.17487/RFC3986, January 2005, 1722 . 1724 [RFC5013] Kunze, J. and T. Baker, "The Dublin Core Metadata Element 1725 Set", RFC 5013, DOI 10.17487/RFC5013, August 2007, 1726 . 1728 [THUMP] Gamiel, K. and J. Kunze, "The HTTP URL Mapping Protocol", 1729 August 2007, 1730 . 1732 Appendix A. ARK Maintenance Agency: arks.org 1734 The ARK Maintenance Agency [ARKagency] at arks.org has several 1735 functions. 1737 o To manage the registry of organizations that will be assigning 1738 ARKs. Organizations can request or update a NAAN by filling out a 1739 form [NAANrequest]. 1741 o To be a clearinghouse for information about ARKs, such as best 1742 practices, introductory documentation, tutorials, community 1743 forums, etc. These supplemental resources help ARK implementor in 1744 high-level applications across different sectors and disciplines, 1745 and with a variety of metadata standards. 1747 o To be a locus of discussion about future versions of the ARK 1748 specification. 1750 Appendix B. Looking up NMAHs Distributed via DNS 1752 This subsection introduces an older method for looking up NMAHs that 1753 is based on the method for discovering URN resolvers described in 1754 [RFC2915]. It relies on querying the DNS system already installed in 1755 the background infrastructure of most networked computers. A query 1756 is submitted to DNS asking for a list of resolvers that match a given 1757 NAAN. DNS distributes the query to the particular DNS servers that 1758 can best provide the answer, unless the answer can be found more 1759 quickly in a local DNS cache as a side-effect of a recent query. 1760 Responses come back inside Name Authority Pointer (NAPTR) records. 1761 The normal result is one or more candidate NMAHs. 1763 In its full generality the [RFC2915] algorithm ambitiously 1764 accommodates a complex set of preferences, orderings, protocols, 1765 mapping services, regular expression rewriting rules, and DNS record 1766 types. This subsection proposes a drastic simplification of it for 1767 the special case of ARK mapping authority discovery. The simplified 1768 algorithm is called Maptr. It uses only one DNS record type (NAPTR) 1769 and restricts most of its field values to constants. The following 1770 hypothetical excerpt from a DNS data file for the NAAN known as 12026 1771 shows three example NAPTR records ready to use with the Maptr 1772 algorithm. 1774 12026.ark.arpa. 1775 ;; US Library of Congress 1776 ;; order pref flags service regexp replacement 1777 IN NAPTR 0 0 "h" "ark" "USLC" lhc.nlm.nih.gov:8080 1778 IN NAPTR 0 0 "h" "ark" "USLC" foobar.zaf.org 1779 IN NAPTR 0 0 "h" "ark" "USLC" sneezy.dopey.com 1781 All the fields are held constant for Maptr except for the "flags", 1782 "regexp", and "replacement" fields. The "service" field contains the 1783 constant value "ark" so that NAPTR records participating in the Maptr 1784 algorithm will not be confused with other NAPTR records. The "order" 1785 and "pref" fields are held to 0 (zero) and otherwise ignored for now; 1786 the algorithm may evolve to use these fields for ranking decisions 1787 when usage patterns and local administrative needs are better 1788 understood. 1790 When a Maptr query returns a record with a flags field of "h" (for 1791 hostport, a Maptr extension to the NAPTR flags), the replacement 1792 field contains the NMAH (hostport) of an ARK service provider. When 1793 a query returns a record with a flags field of "" (the empty string), 1794 the client needs to submit a new query containing the domain name 1795 found in the replacement field. This second sort of record exploits 1796 the distributed nature of DNS by redirecting the query to another 1797 domain name. It looks like this. 1799 12345.ark.arpa. 1800 ;; Digital Library Consortium 1801 ;; order pref flags service regexp replacement 1802 IN NAPTR 0 0 "" "ark" "" dlc.spct.org. 1804 Here is the Maptr algorithm for ARK mapping authority discovery. In 1805 it replace with the NAAN from the ARK for which an NMAH is 1806 sought. 1808 1. Initialize the DNS query: type=NAPTR, query=.ark.arpa. 1810 2. Submit the query to DNS and retrieve (NAPTR) records, discarding 1811 any record that does not have "ark" for the service field. 1813 3. All remaining records with a flags fields of "h" contain 1814 candidate NMAHs in their replacement fields. Set them aside, if 1815 any. 1817 4. Any record with an empty flags field ("") has a replacement field 1818 containing a new domain name to which a subsequent query should 1819 be redirected. For each such record, set query= 1820 then go to step (2). When all such records have been recursively 1821 exhausted, go to step (5). 1823 5. All redirected queries have been resolved and a set of candidate 1824 NMAHs has been accumulated from steps (3). If there are zero 1825 NMAHs, exit -- no mapping authority was found. If there is one 1826 or more NMAH, choose one using any criteria you wish, then exit. 1828 A Perl script that implements this algorithm is included here. 1830 #!/depot/bin/perl 1832 use Net::DNS; # include simple DNS package 1833 my $qtype = "NAPTR"; # initialize query type 1834 my $naa = shift; # get NAAN script argument 1835 my $mad = new Net::DNS::Resolver; # mapping authority discovery 1837 &maptr("$naa.ark.arpa"); # call maptr - that's it 1839 sub maptr { # recursive maptr algorithm 1840 my $dname = shift; # domain name as argument 1841 my ($rr, $order, $pref, $flags, $service, $regexp, 1842 $replacement); 1843 my $query = $mad->query($dname, $qtype); 1844 return # non-productive query 1845 if (! $query || ! $query->answer); 1846 foreach $rr ($query->answer) { 1847 next # skip records of wrong type 1848 if ($rr->type ne $qtype); 1849 ($order, $pref, $flags, $service, $regexp, 1850 $replacement) = split(/\s/, $rr->rdatastr); 1851 if ($flags eq "") { 1852 &maptr($replacement); # recurse 1853 } elsif ($flags eq "h") { 1854 print "$replacement\n"; # candidate NMAH 1855 } 1856 } 1857 } 1859 The global database thus distributed via DNS and the Maptr algorithm 1860 can easily be seen to mirror the contents of the Name Authority 1861 Table file described in the previous section. 1863 Authors' Addresses 1865 John A. Kunze 1866 California Digital Library 1867 415 20th St, 4th Floor 1868 Oakland, CA 94612 1869 USA 1871 Email: jak@ucop.edu 1872 Emmanuelle Bermes 1873 Bibliotheque nationale de France 1874 Quai Francois Mauriac 1875 Paris, Cedex 13 75706 1876 France 1878 Email: emmanuelle.bermes@bnf.fr