idnits 2.17.1 draft-kunze-ark-06.txt: -(62): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(270): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(333): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(380): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(400): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(422): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(424): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(430): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(432): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(770): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(804): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(808): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(985): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1040): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1169): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1193): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1364): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1366): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1367): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1377): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1379): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1382): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1406): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1407): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1408): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1432): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1448): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1465): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1475): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1526): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1631): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1637): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1640): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1641): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document is more than 15 pages and seems to lack a Table of Contents. == There are 53 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 40 longer pages, the longest (page 2) being 63 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 40 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 11 instances of too long lines in the document, the longest one being 7 characters in excess of 72. == There are 15 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 555 has weird spacing: '...eful to remem...' == Line 759 has weird spacing: '... regexp repla...' == Line 1829 has weird spacing: '...for the purpo...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (31 July 2003) is 7574 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'MD5' is defined on line 1707, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'DCORE' -- Possible downref: Non-RFC (?) normative reference: ref. 'DOI' ** Obsolete normative reference: RFC 822 (ref. 'EMHDRS') (Obsoleted by RFC 2822) -- Possible downref: Non-RFC (?) normative reference: ref. 'ERC' -- Possible downref: Non-RFC (?) normative reference: ref. 'HKMP' ** Obsolete normative reference: RFC 2616 (ref. 'HTTP') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref. 'MD5') ** Obsolete normative reference: RFC 2915 (ref. 'NAPTR') (Obsoleted by RFC 3401, RFC 3402, RFC 3403, RFC 3404) -- Possible downref: Non-RFC (?) normative reference: ref. 'NLMPerm' -- Possible downref: Non-RFC (?) normative reference: ref. 'PURL' -- Possible downref: Non-RFC (?) normative reference: ref. 'REG' ** Obsolete normative reference: RFC 2396 (ref. 'URI') (Obsoleted by RFC 3986) ** Downref: Normative reference to an Informational RFC: RFC 2288 (ref. 'URNBIB') ** Obsolete normative reference: RFC 2141 (ref. 'URNSYN') (Obsoleted by RFC 8141) ** Obsolete normative reference: RFC 2611 (ref. 'URNNID') (Obsoleted by RFC 3406) Summary: 14 errors (**), 0 flaws (~~), 10 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet-Draft: draft-kunze-ark-06.txt J. Kunze 3 ARK Identifier Scheme University of California (UCOP) 4 Expires 31 January 2004 R. P. C. Rodgers 5 US National Library of Medicine 6 31 July 2003 8 The ARK Persistent Identifier Scheme 10 (http://www.ietf.org/internet-drafts/draft-kunze-ark-06.txt) 12 Status of this Document 14 This document is an Internet-Draft and is in full conformance with 15 all provisions of Section 10 of RFC2026. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as ``work in progress.'' 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 Distribution of this document is unlimited. Please send comments to 34 jak@ucop.edu. 36 Copyright (C) The Internet Society (2003). All Rights Reserved. 38 Abstract 40 The ARK (Archival Resource Key) is a scheme intended to facilitate 41 the persistent naming and retrieval of information objects. It 42 comprises an identifier syntax and three services. An ARK has four 43 components: 45 [http://NMAH/]ark:/NAAN/Name 47 an optional and mutable Name Mapping Authority Hostport part (NMAH, 48 where "hostport" is a hostname followed optionally by a colon and 49 port number), the "ark:" label, the Name Assigning Authority Number 50 (NAAN), and the assigned Name. The NAAN and Name together form the 51 immutable persistent identifier for the object. 53 An ARK request is an ARK with a service request and a question mark 54 appended to it. Use of an ARK request proceeds in two steps. First, 55 the NMAH, if not specified, is discovered based on the NAAN. Two 56 discovery methods are proposed: one is file based, the other based 57 on the DNS NAPTR record. Second, the ARK request is submitted to the 58 NMAH. Three ARK services are defined, gaining access to: (1) the 59 object (or a sensible substitute), (2) a description of the object 60 (metadata), and (3) a description of the commitment made by the NMA 61 regarding the persistence of the object (policy). These services are 62 defined initially to use the HTTP protocol. When the NMAH is speci� 63 fied, the ARK is a valid URL that can gain access to ARK services 64 using an unmodified Web client. 66 1. Introduction 68 This document describes a scheme for the high-quality naming of 69 information resources. The scheme, called the Archival Resource Key 70 (ARK), is well suited to long-term access and identification for any 71 information resources that accommodate reasonably regular electronic 72 description. This includes digital documents, databases, software, 73 and websites, as well as physical objects (such as books, bones, and 74 statues) and intangible objects (chemicals, diseases, vocabulary 75 terms, performances). Hereafter the term "object" refers to an 76 information resource. The term ARK itself refers both to the scheme 77 and to any single identifier that conforms to it. 79 Schemes for persistent identification of network-accessible objects 80 are not new. In the early 1990's, the design of the Uniform Resource 81 Name [URNSYN] responded to the observed failure rate of URLs by 82 articulating an indirect, non-hostname-based naming scheme and the 83 need for responsible name management. Meanwhile, promoters of the 84 Digital Object Identifier [DOI] succeeded in building a community of 85 providers around a mature software system that supports name 86 management. The Persistent Uniform Resource Locator [PURL] was a 87 third scheme that has the unique advantage of working with unmodified 88 web browsers. The ARK scheme is a new approach. 90 A founding principle of the ARK is that persistence is purely a 91 matter of service. Persistence is neither inherent in an object nor 92 conferred on it by a particular naming syntax. Rather, persistence 93 is achieved through a provider's successful stewardship of objects 94 and their identifiers. The highest level of persistence will be 95 reinforced by a provider's robust contingency, redundancy, and 96 succession strategies. It is further safeguarded to the extent that 97 a provider's mission is shielded from marketplace and political 98 instabilities. 100 1.1. Three Reasons to Use ARKs 102 The first requirement of an ARK is to give users a link from an 103 object to a promise of stewardship for it. That promise is a multi- 104 faceted covenant that binds the word of an identified service 105 provider to a specific set of responsibilities. No one can tell if 106 successful stewardship will take place because no one can predict the 107 future. Reasonable conjecture, however, may be based on past 108 performance. There must be a way to tie a promise of persistence to 109 a provider's demonstrated or perceived ability -- its reputation -- 110 in that arena. Provider reputations would then rise and fall as 111 promises are observed variously to be kept and broken. This is 112 perhaps the best way we have for gauging the strength of any 113 persistence promise. 115 The second requirement of an ARK is to give users a link from an 116 object to a description of it. The problem with a naked identifier 117 is that without a description real identification is incomplete. 118 Identifiers common today are relatively opaque, though some contain 119 ad hoc clues that reflect fleeting life cycle events such as the 120 address of a short stay in a filesystem hierarchy. Possession of 121 both an identifier and an object is some improvement, but positive 122 identification may still be elusive since the object itself need not 123 include a matching identifier or be transparent enough to reveal its 124 identity without significant research. In either case, what is 125 called for is a record bearing witness to the identifier's 126 association with the object, as supported by a recorded set of object 127 characteristics. This descriptive record is partly an identification 128 "receipt" with which users and archivists can verify an object's 129 identity after brief inspection and a plausible match with recorded 130 characteristics such as title and size. 132 The final requirement of an ARK is to give users a link to the object 133 itself (or to a copy) if at all possible. Persistent access is the 134 central duty of an ARK, with persistent identification playing a 135 vital but supporting role. Object access may not be feasible for 136 various reasons, such as catastrophic loss of the object, a licensing 137 agreement that keeps an archive "dark" for a period of years, or when 138 an object's own lack of tangible existence precludes normal concepts 139 of access (e.g., a vocabulary term might be accessed through its 140 definition). In such cases the ARK's identification role assumes a 141 much higher profile. But attempts to simplify the persistence 142 problem by decoupling access from identification and concentrating 143 exclusively on the latter are of questionable utility. A perfect 144 system for assigning forever unique identifiers might be created, but 145 if it did so without reducing access failure rates, no one would be 146 interested. The central issue -- which may be summed up as the "HTTP 147 404 Not Found" problem -- would not have been addressed. 149 1.2. Organizing Support for ARKs 151 Co-location of persistent access and identification services is 152 natural. Any organization that undertakes ongoing support of true 153 persistent identification (which includes description) is well-served 154 if it controls, owns, or otherwise has clear internal access to the 155 identified objects, and this gives it an advantage if it wishes also 156 to support persistent external access. Conversely, the latter 157 implies a commitment to collection management activities such as 158 monitoring, acquisition, verification, and change control over 159 objects that are persistently identified at least for the sake of 160 internal record keeping and accountability; this covers the major 161 prerequisite for external support of persistent identification. 162 Organizing ARK services under one roof thus tends to make sense. 164 ARK support is not for everybody. By requiring specific, revealed 165 commitments to preservation, object access, and description, the bar 166 for providing ARK services is high. On the other hand, it would be 167 hard to grant credence to a persistence promise from an organization 168 that could not muster the minimum ARK services. Not that there isn't 169 a business model for an ARK-like, description-only service built on 170 top of another organization's full complement of ARK services. For 171 example, there might be competition at the description level for 172 abstracting and indexing a body of scientific literature archived in 173 a combination of open and fee-based repositories. Such a business 174 would benefit more from persistence than it would directly support 175 it. 177 1.3. A Definition of Identifier 179 Heretofore, persistence discussion has been hampered by a borrowed 180 meaning for "identifier" that emerged as a side effect of defining 181 the Uniform Resource Identifier in [URI]: 183 (formerly) An identifier is a sequence of characters with a 184 restricted syntax ... that can act as a reference to something 185 that has identity. 187 The term works in context, but falters when employed for persistence. 188 Troubling phrases arise, such as, 190 "The goal is to create an identifier that does not break." 192 As defined this kind of identifier "breaks" when it sustains damage 193 to its character sequence, but really what breaks has to do with the 194 identifier's reference role. The following definition is proposed. 196 (new definition) An identifier is an association between a 197 string (a sequence of characters) and an information resource. 198 That association is made manifest by a record (e.g., a 199 cataloging or other metadata record) that binds the identifier 200 string to a set of identifying resource characteristics. 202 The identifier (the association) must be vouched for by some sort of 203 record. In the complete absence of any testimony (e.g., metadata) 204 regarding an association, a would-be identifier string is a 205 meaningless sequence of characters. To keep an externally visible 206 but otherwise internal identifier string opaque to outsiders, for 207 example, it suffices for an organization not to disclose the nature 208 of its association. For our immediate purpose, actual existence of 209 an association record is more important than its authenticity. If 210 one is lucky an object carries its own identifier as part of itself 211 (e.g., imprinted on the first page), but in processes such as 212 resource discovery and retrieval the typical object is often unwieldy 213 or unavailable (such as when licensing restrictions are in effect). 214 A metadata record that includes the identifier string is the next 215 best thing -- a conveniently manipulable surrogate that can act as 216 both an association "receipt" and "declaration". 218 It now makes sense to speak of preventing an identifier, as an 219 association, from breaking. Having said that, this document still 220 (ab)uses the terms "ARK" and "identifier" as shorthands to refer to 221 identifier strings, in other words, to sequences of characters. Thus 222 a discussion of ARK syntax refers to a string format, not an 223 association format. The context should make the meaning clear. 225 2. ARK Anatomy 227 An ARK is represented by a sequence of characters (a string) that 228 contains the label, "ark:", optionally preceded by the beginning part 229 of a URL. Here is a diagrammed example. 231 http://foobar.zaf.org/ark:/12025/654xz321 232 \___________________/ \__/ \___/ \______/ 233 (replaceable) | | | 234 | ARK Label | Name (assigned by the NAA) 235 | | 236 Name Mapping Authority Name Assigning Authority 237 Hostport (NMAH) Number (NAAN) 239 The ARK syntax can be summarized, 241 [http://NMAH/]ark:/NAAN/Name 243 where the NMAH part is in brackets to indicate that it is temporary, 244 replaceable, and optional. 246 2.1. The Name Mapping Authority Hostport (NMAH) 248 Before the "ark:" label may appear an optional Name Mapping Authority 249 Hostport (NMAH) that is a temporary address where ARK service 250 requests may be sent. It consists of "http://" (or any service 251 specification valid for a URL) followed by an Internet hostname or 252 hostport combination having the same format and semantics as the 253 hostport part of a URL. The most important thing about the NMAH is 254 that it is "identity inert" from the point of view of object 255 identification. In other words, ARKs that differ only in the 256 optional NMAH part identify the same object. Thus, for example, the 257 following three ARKs are synonyms for but one information resource: 259 http://foobar.zaf.org/ark:/12025/654xz321 260 http://sneezy.dopey.com/ark:/12025/654xz321 261 ark:/12025/654xz321 263 The NMAH part makes an ARK into an actionable URL. Conversely, any 264 URL whose path component begins with "ark:/" stands a reasonable 265 chance of being an ARK (only because such URLs are not common), but 266 further verification is still required (such as probing the URL for 267 the three ARK services). 269 The NMAH part is temporary, disposable, and replaceable. Over time 270 the NMAH will likely stop working and have to be replaced with a cur� 271 rently active service provider. This relies on a mapping authority 272 discovery process, of which two alternate methods are outlined in a 273 later section. Meanwhile, a carefully chosen NMAH can be as durable 274 as any Internet domain name, and so may last for a decade or longer. 275 Users should be prepared, however, to refresh the NMAH because the 276 one found in the URL form of the ARK may have stopped working. 278 The above method for creating an actionable identifier from a basic 279 ARK (prepending "http://" and an NMAH) is itself temporary. Assuming 280 that the reign of [HTTP] in information retrieval will end one day, 281 ARKs will have to be converted into new kinds of actionable identi� 282 fiers. In any event, if ARKs see widespread use, web browsers would 283 presumably evolve to perform this (currently simple) transformation 284 automatically. 286 2.2. The Name Assigning Authority Number (NAAN) 288 The part of the ARK directly following the "ark:" is the Name 289 Assigning Authority Number (NAAN) enclosed in `/' (slash) characters. 290 This part is always required, as it identifies the organization that 291 originally assigned the Name of the object. It is used to discover a 292 currently valid NMAH and to provide top-level partitioning of the 293 space of all ARKs. NAANs are registered in a manner similar to URN 294 Namespaces, but they are pure numbers consisting of 5 digits or 9 295 digits. Thus, the first 100,000 registered NAAs fit compactly into 296 the 5 digits, and if growth warrants, the next billion fit into the 9 297 digit form. In either case the fixed odd number of digits helps 298 reduce the chances of finding a NAAN out of context and confusing it 299 with nearby quantities such as 4-digit dates. 301 2.3. The Name Part 303 The final part of the ARK is the Name assigned by the NAA, and it is 304 also required. The Name is a string of visible ASCII characters and 305 should be less than 128 bytes in length. The length restriction 306 keeps the ARK short enough to append ordinary ARK request strings 307 without running into transport restrictions within HTTP GET requests. 308 Characters may be letters, digits, or any of these six characters: 310 = @ $ _ * + # 312 The following characters may also be used, but in limited ways: 314 / . - % 316 The characters `/' and `.' are ignored if either appears as the last 317 character of an ARK. If used internally, they allow a name assigning 318 authority to reveal object hierarchy and object variants as described 319 in the next two sections. 321 A `-' (hyphen) may appear in an ARK, but must be ignored in lexical 322 comparisons. The `%' character is reserved for %-encoding all other 323 octets that would appear in the ARK string, in the same manner as for 324 URIs [URI]. A %-encoded octet consists of a `%' followed by two hex 325 digits; for example, "%7d" stands in for `}'. Lower case hex digits 326 are preferred to reduce the chances of false acronym recognition; 327 thus it is better to use "%acT" instead of "%ACT". The character `%' 328 itself must be represented using "%25". As with URNs, %-encoding 329 permits ARKs to support legacy namespaces (e.g., ISBN, ISSN, SICI) 330 that have less restricted character repertoires [URNBIB]. 332 The creation of names that include linguistically based constructs 333 (having recognizable meaning from natural language) is strongly dis� 334 couraged if long-term persistence is a naming priority. Such names 335 do not age or travel well. Names that look more or less like numbers 336 avoid common problems that defeat persistence and international 337 acceptance. The use of digits is highly recommended. Mixing in non- 338 vowel alphabetic characters is a relatively safe and easy way to 339 achieve more compact names, although any character repertoire can 340 work if potentially troublesome names will be discarded during a 341 screening process. More on naming considerations is given in a later 342 section. 344 2.3.1. Names that Reveal Object Hierarchy 346 A name assigning authority may choose to reveal the presence of a 347 hierarchical relationship between objects using the `/' (slash) 348 character in the Name part of an ARK. If the Name contains an 349 internal slash, the piece to its left indicates a containing object. 350 For example, publishing an ARK of the form, 352 ark:/12025/654/xz/321 354 is equivalent to publishing three ARKs, 356 ark:/12025/654/xz/321 357 ark:/12025/654/xz 358 ark:/12025/654 360 together with a declaration that the first object is contained in the 361 second object, and that the second object is contained in the third. 363 Revealing the presence of hierarchy is completely up to the assigning 364 authority. It is hard enough to commit to one object's name, let 365 alone to three objects' names and to a specific, ongoing relatedness 366 among them. Thus, regardless of whether hierarchy was present ini� 367 tially, the assigning authority, by not using slashes, reveals no 368 shared inferences about hierarchical or other inter-relatedness in 369 the following ARKs: 371 ark:/12025/654_xz_321 372 ark:/12025/654_xz 373 ark:/12025/654xz321 374 ark:/12025/654xz 375 ark:/12025/654 377 Note that slashes around the ARK's NAAN (/12025/ in these examples) 378 are not part of the ARK's Name and therefore do not indicate the 379 existence of some sort of NAAN super object containing all objects in 380 its namespace. A slash must have at least one non-structural charac� 381 ter (one that is neither a slash nor a period) on both sides in order 382 for it to separate recognizable structural components. So initial or 383 final slashes may be removed, and double slashes may be converted 384 into single slashes. 386 2.3.2. Names that Reveal Object Variants 388 A name assigning authority may choose to reveal the possible presence 389 of variant objects using the `.' (period) character in the Name part 390 of an ARK. If the Name contains an internal period, the piece to its 391 left is a base name and the piece to its right up to the end of the 392 ARK or to the next period is a suffix. A Name may have more than one 393 suffix, for example, 395 ark:/12025/654.24 396 ark:/12025/xz4/654.24 397 ark:/12025/654.f55.g78.v20 399 There are two main rules. First, if two ARKs share the same base 400 name but have different suffixes, the corresponding objects were con� 401 sidered variants of each other (different formats, languages, ver� 402 sions, etc.) by the assigning authority. Thus, the following ARKs 403 are variants of each other: 405 ark:/12025/654.f55.g78.v20 406 ark:/12025/654.321xz 407 ark:/12025/654.44 409 Second, publishing an ARK with a suffix implies the existence of at 410 least one variant identified by the ARK without its suffix. The ARK 411 otherwise permits no further assumptions about what variants might 412 exist. So publishing the ARK, 414 ark:/12025/654.f55.g78.v20 416 is equivalent to publishing the four ARKs, 417 ark:/12025/654.f55.g78.v20 418 ark:/12025/654.f55.g78 419 ark:/12025/654.f55 420 ark:/12025/654 422 Revealing the possibility of variants is completely up to the assign� 423 ing authority. It is hard enough to commit to one object's name, let 424 alone to multiple variants' names and to a specific, ongoing related� 425 ness among them. The assigning authority is the sole arbiter of what 426 constitutes a variant within its namespace, and whether to reveal 427 that kind of relatedness by using periods within its names. 429 A period must have at least one non-structural character (one that is 430 neither a slash nor a period) on both sides in order for it to sepa� 431 rate recognizable structural components. So initial or final periods 432 may be removed, and double periods may be converted into single peri� 433 ods. Multiple suffixes should be arranged in sorted order (pure 434 ASCII collating sequence) at the end of an ARK. 436 2.3.3. Hyphens are Ignored 438 Hyphens are always ignored in ARKs. Hyphens may be added to an ARK's 439 Name part for readability, or during the formatting and wrapping of 440 text lines, but (as in phone numbers) they are treated as if they 441 were not present. Thus, like the NMAH, hyphens are "identity inert" 442 in comparing ARKs for equivalence. For example, the following ARKs 443 are equivalent for purposes of comparison and ARK service access: 445 ark:/12025/65-4-xz-321 446 ark:sneezy.dopey.com/12025/654--xz32-1 447 ark:/12025/654xz321 449 2.4. Normalization and Lexical Equivalence 451 To determine if two or more ARKs identify the same object, the ARKs 452 are compared for lexical equivalence after first being normalized. 453 Since ARK strings may appear in various forms (e.g., having different 454 NMAHs), normalizing them minimizes the chances that comparing two ARK 455 strings for equality will fail unless they actually identify 456 different objects. In a specified-host ARK (one having an NMAH), the 457 NMAH never participates in such comparisons. 459 Normalization of an ARK for the purpose of octet-by-octet equality 460 comparison with another ARK consists of four steps. First, any upper 461 case letters in the "ark:" label and the two characters following a 462 `%' are converted to lower case. The case of all other letters in 463 the ARK string must be preserved. Second, any NMAH part is removed 464 (everything from an initial "http://" up to the next slash) and all 465 hyphens are removed. 467 Third, structural characters (slash and period) are normalized. 468 Initial and final occurrences are removed, and two structural 469 characters in a row (e.g., // or ./) are replaced by the first 470 character, iterating until each occurrence has at least one non- 471 structural character on either side. Finally, if there are any 472 components with a period on the left and a slash on the right, either 473 the component and the preceding period must be moved to the end of 474 the Name part or the ARK must be thrown out as malformed. 476 The fourth and final step is to arrange the suffixes in ASCII 477 collating sequence (that is, to sort them) and to remove duplicate 478 suffixes, if any. It is also permissible to throw out ARKs for which 479 the suffixes are not sorted. 481 The resulting ARK string is now normalized. Comparisons between 482 normalized ARKs are case-sensitive, meaning that upper case letters 483 are considered different from their lower case counterparts. 485 To keep ARK string variation to a minimum, no reserved ARK characters 486 should be %-encoded unless it is deliberately to conceal their 487 reserved meanings. No non-reserved ARK characters should ever be 488 %-encoded. Finally, no %-encoded character should ever appear in an 489 ARK in its decoded form. 491 2.5. Naming Considerations 493 The ARK has different goals from the URI, so it has different 494 character set requirements. Because linguistic constructs imperil 495 persistence, for ARKs non-ASCII character support is unimportant. 496 ARKs and URIs share goals of transcribability and transportability 497 within web documents, so characters are required to be visible, non- 498 conflicting with HTML/XML syntax, and not subject to tampering during 499 transmission across common transport gateways. Add the goal of 500 making an undelimited ARK recognizable in running prose, as in 501 ark:/12025/=@_22*$, and certain punctuation characters (e.g., comma, 502 period) end up being excluded from the ARK lest the end of a phrase 503 or sentence be mistaken for part of the ARK. 505 A valuable technique for provision of persistent objects is to try to 506 arrange for the complete identifier to appear on, with, or near its 507 retrieved object. An object encountered at a moment in time when its 508 discovery context has long since disappeared could then easily be 509 traced back to its metadata, to alternate versions, to updates, etc. 510 This has seen reasonable success, for example, in book publishing and 511 software distribution. 513 If persistence is the goal, a deliberate local strategy for 514 systematic name assignment is crucial. Names must be chosen with 515 great care. Poorly chosen and managed names will devastate any 516 persistence strategy, and they do not discriminate based on naming 517 scheme. Whether a mistakenly re-assigned identifier is a URN, DOI, 518 PURL, URL, or ARK, the damage -- failed access and confusion -- is 519 not mitigated more in one scheme than in another. Conversely, in- 520 house efforts to manage names responsibly will go much further 521 towards safeguarding persistence than any choice of naming scheme or 522 name resolution technology. 524 Hostnames appearing in any identifier meant to be persistent must be 525 chosen with extra care. The tendency in hostname selection has 526 traditionally been to choose a token with recognizable attributes, 527 such as a corporate brand, but that tendency wreaks havoc with 528 persistence that is to outlive brands, corporations, subject 529 classifications, and natural language semantics (e.g., what did the 530 three letters "gay" mean 1958, 1978, and 1998?). Today's recognized 531 and correct attributes are tomorrow's stale or incorrect attributes. 532 In making hostnames (any names, actually) long-term persistent, it 533 helps to eliminate recognizable attributes to the extent possible. 534 This affects selection of any name based on URLs, including PURLs and 535 the explicitly disposable NMAHs. There is no excuse for a provider 536 that manages its internal names impeccably not to exercise the same 537 care in choosing what could be an exceptionally durable hostname, 538 especially if it would form the prefix for all the provider's URL- 539 based external names. Registering an opaque hostname in the ".org" 540 or ".net" domain would not be a bad start. 542 Dubious persistence speculation does not make selecting naming 543 strategies any easier. For example, despite rumors to the contrary, 544 there are really no obvious reasons why the organizations registering 545 DNS names, URN Namespaces, and DOI publisher IDs should have among 546 them one that is intrinsically more fallible than the next. 547 Moreover, it is a misconception that the demise of DNS and of HTTP 548 need adversely affect the persistence of URLs. At such a time, 549 certainly URLs from the present day might not then be actionable by 550 our present-day mechanisms, but resolution systems for future non- 551 actionable URLs are no harder to imagine than resolution systems for 552 present-day non-actionable URNs and DOIs. There is no more stable a 553 namespace than one that is dead and frozen, and that would then 554 characterize the space of names bearing the "http://" prefix. It is 555 useful to remember that just because hostnames have been carelessly 556 chosen in their brief history does not mean that they are unsuitable 557 in NMAHs (and URLs) intended for use in situations demanding the 558 highest level of persistence available in the Internet environment. 559 A well-planned name assignment strategy is everything. 561 3. Assigners of ARKs 563 A Name Assigning Authority (NAA) is an organization that creates (or 564 delegates creation of) long-term associations between identifiers and 565 information objects. Examples of NAAs include national libraries, 566 national archives, and publishers. An NAA may arrange with an 567 external organization for identifier assignment. The US Library of 568 Congress, for example, allows OCLC (the Online Computer Library 569 Center, a major world cataloger of books) to create associations 570 between Library of Congress call numbers (LCCNs) and the books that 571 OCLC processes. A cataloging record is generated that testifies to 572 each association, and the identifier is included by the publisher, 573 for example, in the front matter of a book. 575 An NAA does not so much create an identifier as create an 576 association. The NAA first draws an unused identifier string from 577 its namespace, which is the set of all identifiers under its control. 578 It then records the assignment of the identifier to an information 579 object having sundry witnessed characteristics, such as a particular 580 author and modification date. A namespace is usually reserved for an 581 NAA by agreement with recognized community organizations (such as 582 IANA and ISO) that all names containing a particular string be under 583 its control. In the ARK an NAA is represented by the Name Assigning 584 Authority Number (NAAN). 586 The ARK namespace reserved for an NAA is the set of names bearing its 587 particular NAAN. For example, all strings beginning with 588 "ark:/12025/" are under control of the NAA registered under 12025, 589 which might be the National Library of Finland. Because each NAA has 590 a different NAAN, names from one namespace cannot conflict with those 591 from another. Each NAA is free to assign names from its namespace 592 (or delegate assignment) according to its own policies. These 593 policies must be documented in a manner similar to the declarations 594 required for URN Namespace registration [URNNID]. 596 For now, registration of ARK NAAs is in a bootstrapping phase. To 597 register, please read about the mapping authority discovery file in 598 the next section and send email to jak@ucop.edu. 600 4. Finding a Name Mapping Authority 602 In order to derive an actionable identifier (these days, a URL) from 603 an ARK, a hostport (hostname or hostname plus port combination) for a 604 working Name Mapping Authority (NMA) must be found. An NMA is a 605 service that is able to respond to the three basic ARK service 606 requests. Relying on registration and client-side discovery, NMAs 607 make known which NAAs' identifiers they are willing to service. 609 Upon encountering an ARK, a user (or client software) looks inside it 610 for the optional NMAH part (the hostport of the NMA's ARK service). 611 If it contains an NMAH that is working, this NMAH discovery step may 612 be skipped; the NMAH effectively uses the beginning of an ARK to 613 cache the results of a prior mapping authority discovery process. If 614 a new NMAH needs to found, the client looks inside the ARK again for 615 the NAAN (Name Assigning Authority Number). Querying a global 616 database, it then uses the NAAN to look up all current NMAHs that 617 service ARKs issued by the identified NAA. The global database is 618 key, and two specific methods for querying it are given in this 619 section. 621 In the interests of long-term persistence, however, ARK mechanisms 622 are first defined in high-level, protocol-independent terms so that 623 mechanisms may evolve and be replaced over time without compromising 624 fundamental service objectives. Either or both specific methods 625 given here may eventually be supplanted by better methods since, by 626 design, the ARK scheme does not depend on a particular method, but 627 only on having some method to locate an active NMAH. 629 At the time of issuance, at least one NMAH for an ARK should be 630 prepared to service it. That NMA may or may not be administered by 631 the Name Assigning Authority (NAA) that created it. Consider the 632 following hypothetical example of providing long-term access to a 633 cancer research journal. The publisher wishes to turn a profit and 634 the National Library of Medicine wishes to preserve the scholarly 635 record. An agreement might be struck whereby the publisher would act 636 as the NAA and the national library would archive the journal issue 637 when it appears, but without providing direct access for the first 638 six months. During the first six months of peak commercial 639 viability, the publisher would retain exclusive delivery rights and 640 would charge access fees. Again, by agreement, both the library and 641 the publisher would act as NMAs, but during that initial period the 642 library would redirect requests for issues less than six months old 643 to the publisher. At the end of the waiting period, the library 644 would then begin servicing requests for issues older than six months 645 by tapping directly into its own archives. Meanwhile, the publisher 646 might routinely redirect incoming requests for older issues to the 647 library. Long-term access is thereby preserved, and so is the 648 commercial incentive to publish content. 650 There is never a requirement that an NAA also run an NMA service, 651 although it seems not an unlikely scenario. Over time NAAs and NMAs 652 would come and go. One NMA would succeed another, and there might be 653 many NMAs serving the same ARKs simultaneously (e.g., as mirrors or 654 as competitors). There might also be asymmetric but coordinated NMAs 655 as in the library-publisher example above. 657 4.1. Looking Up NMAHs in a Globally Accessible File 659 This subsection describes a way to look up NMAHs using a simple text 660 file. For efficient access the file may be stored in a local 661 filesystem, but it needs to be reloaded periodically to incorporate 662 updates. It is not expected that the size of the file or frequency 663 of update should impose an undue maintenance or searching burden any 664 time soon, for even primitive linear search of a file with ten- 665 thousand NAAs is a subsecond operation on modern server machines. 666 The proposed file strategy is similar to the /etc/hosts file strategy 667 that supported Internet host address lookup for a period of years 668 before the advent of the Domain Name System [DNS]. 670 A copy of the current file (at the time of writing) appears in an 671 appendix and is available on the web. A minimal version of the file 672 appears below. Comment lines (lines that begin with `#') explain the 673 format and give the file's modification time, reloading address, and 674 NAA registration instructions. There is even a Perl script that 675 processes the file embedded in the file's comments. Because this is 676 still a proposed file, none of the values in it are real. 678 # 679 # Name Assigning Authority / Name Mapping Authority Lookup Table 680 # Last change: 31 July 2003 681 # Reload from: http://ark.nlm.nih.gov/etc/natab 682 # Mirrored at: http://ark.cdlib.org/natab 683 # To register: mailto:jak@ucop.edu?Subject=naareg 684 # Process with: Perl script at end of this file (optional) 685 # 686 # Each NAA appears at the beginning of a line with the NAA Number 687 # first, a colon, and an ARK or URL to a statement of naming policy 688 # (see http://ark.cdlib.org for an example). 689 # All the NMA hostports that service an NAA are listed, one per 690 # line, indented, after the corresponding NAA line. 691 # 692 # National Library of Medicine 693 12025: http://www.nlm.nih.gov/xxx/naapolicy.html 694 ark.nlm.nih.gov USNLM 695 foobar.zaf.org UCSF 696 sneezy.dopey.com BIREME 697 # 698 # Library of Congress 699 12026: http://www.loc.gov/xxx/naapolicy.html 700 foobar.zaf.org USLC 701 sneezy.dopey.com USLC 702 # 703 # National Agriculture Library 704 12027: http://www.nal.gov/xxx/naapolicy.html 705 foobar.zaf.gov:80 USNAL 706 # 707 # University of California 708 13030: http://ark.cdlib.org/ 709 ark.cdlib.org CDL 710 # 711 # World Intellectual Property Organization 712 13038: http://www.wipo.int/xxx/naapolicy.html 713 www.wipo.int WIPO 714 # 715 #--- end of data --- 716 # The following Perl script takes an NAA as argument and outputs 717 # the NMAs in this file listed under any matching NAA. 718 # 719 # my $naa = shift; 720 # while (<>) { 721 # next if (! /^$naa:/); 722 # while (<>) { 723 # last if (! /^[#\s]./); 724 # print "$1\n" if (/^\s+(\S+)/); 725 # } 726 # } 727 # 728 # Create a g/t/nroff-safe version of this table with the UNIX command, 729 # 730 # expand natab | sed 's/\\/\\\e/g' > natab.roff 731 # 732 # end of file 734 4.2. Looking up NMAHs Distributed via DNS 736 This subsection introduces a method for looking up NMAHs that is 737 based on the method for discovering URN resolvers described in 738 [NAPTR]. It relies on querying the DNS system already installed in 739 the background infrastructure of most networked computers. A query 740 is submitted to DNS asking for a list of resolvers that match a given 741 NAAN. DNS distributes the query to the particular DNS servers that 742 can best provide the answer, unless the answer can be found more 743 quickly in a local DNS cache as a side-effect of a recent query. 744 Responses come back inside Name Authority Pointer (NAPTR) records. 745 The normal result is one or more candidate NMAHs. 747 In its full generality the [NAPTR] algorithm ambitiously accommodates 748 a complex set of preferences, orderings, protocols, mapping services, 749 regular expression rewriting rules, and DNS record types. This 750 subsection proposes a drastic simplification of it for the special 751 case of ARK mapping authority discovery. The simplified algorithm is 752 called Maptr. It uses only one DNS record type (NAPTR) and restricts 753 most of its field values to constants. The following hypothetical 754 excerpt from a DNS data file for the NAAN known as 12026 shows three 755 example NAPTR records ready to use with the Maptr algorithm. 757 12026.ark.arpa. 758 ;; US Library of Congress 759 ;; order pref flags service regexp replacement 760 IN NAPTR 0 0 "h" "ark" "USLC" lhc.nlm.nih.gov:8080 761 IN NAPTR 0 0 "h" "ark" "USLC" foobar.zaf.org 762 IN NAPTR 0 0 "h" "ark" "USLC" sneezy.dopey.com 764 All the fields are held constant for Maptr except for the "flags", 765 "regexp", and "replacement" fields. The "service" field contains the 766 constant value "ark" so that NAPTR records participating in the Maptr 767 algorithm will not be confused with other NAPTR records. The "order" 768 and "pref" fields are held to 0 (zero) and otherwise ignored for now; 769 the algorithm may evolve to use these fields for ranking decisions 770 when usage patterns and local administrative needs are better under� 771 stood. 773 When a Maptr query returns a record with a flags field of "h" (for 774 hostport, a Maptr extension to the NAPTR flags), the replacement 775 field contains the NMAH (hostport) of an ARK service provider. When 776 a query returns a record with a flags field of "" (the empty string), 777 the client needs to submit a new query containing the domain name 778 found in the replacement field. This second sort of record exploits 779 the distributed nature of DNS by redirecting the query to another 780 domain name. It looks like this. 782 12345.ark.arpa. 783 ;; Digital Library Consortium 784 ;; order pref flags service regexp replacement 785 IN NAPTR 0 0 "" "ark" "" dlc.spct.org. 787 Here is the Maptr algorithm for ARK mapping authority discovery. In 788 it replace with the NAAN from the ARK for which an NMAH is 789 sought. 791 (1) Initialize the DNS query: type=NAPTR, 792 query=.ark.arpa. 794 (2) Submit the query to DNS and retrieve (NAPTR) records, dis� 795 carding any record that does not have "ark" for the service 796 field. 798 (3) All remaining records with a flags fields of "h" contain 799 candidate NMAHs in their replacement fields. Set them aside, if 800 any. 802 (4) Any record with an empty flags field ("") has a replacement 803 field containing a new domain name to which a subsequent query 804 should be redirected. For each such record, set query= then go to step (2). When all such records have been 806 recursively exhausted, go to step (5). 808 (5) All redirected queries have been resolved and a set of can� 809 didate NMAHs has been accumulated from steps (3). If there are 810 zero NMAHs, exit -- no mapping authority was found. If there is 811 one or more NMAH, choose one using any criteria you wish, then 812 exit. 814 A Perl script that implements this algorithm is included here. 816 #!/depot/bin/perl 818 use Net::DNS; # include simple DNS package 819 my $qtype = "NAPTR"; # initialize query type 820 my $naa = shift; # get NAAN script argument 821 my $mad = new Net::DNS::Resolver; # mapping authority discovery 823 &maptr("$naa.ark.arpa"); # call maptr - that's it 825 sub maptr { # recursive maptr algorithm 826 my $dname = shift; # domain name as argument 827 my ($rr, $order, $pref, $flags, $service, $regexp, 828 $replacement); 829 my $query = $mad->query($dname, $qtype); 830 return # non-productive query 831 if (! $query || ! $query->answer); 832 foreach $rr ($query->answer) { 833 next # skip records of wrong type 834 if ($rr->type ne $qtype); 835 ($order, $pref, $flags, $service, $regexp, 836 $replacement) = split(/\s/, $rr->rdatastr); 837 if ($flags eq "") { 838 &maptr($replacement); # recurse 839 } elsif ($flags eq "h") { 840 print "$replacement\n"; # candidate NMAH 841 } 842 } 843 } 845 The global database thus distributed via DNS and the Maptr algorithm 846 can easily be seen to mirror the contents of the Name Authority Table 847 file described in the previous section. 849 5. Generic ARK Service Definition 851 An ARK request's output is delivered information; examples include 852 the object itself, a policy declaration (e.g., a promise of support), 853 a descriptive metadata record, or an error message. ARK services 854 must be couched in high-level, protocol-independent terms if 855 persistence is to outlive today's networking infrastructural 856 assumptions. The high-level ARK service definitions listed below are 857 followed in the next section by a concrete method (one of many 858 possible methods) for delivering these services with today's 859 technology. 861 5.1. Generic ARK Access Service (access, location) 863 Returns (a copy of) the object or a redirect to the same, although a 864 sensible object proxy may be substituted. Examples of sensible 865 substitutes include, 866 - a table of contents instead of a large complex document, 867 - a home page instead of an entire web site hierarchy, 868 - a rights clearance challenge before accessing protected data, 869 - directions for access to an offline object (e.g., a book), 870 - a description of an intangible object (a disease, an event), or 871 - an applet acting as "player" for a large multimedia object. 873 May also return a discriminated list of alternate object locators. 874 If access is denied, returns an explanation of the object's current 875 (perhaps permanent) inaccessibility. 877 5.2. Generic Policy Service (permanence, naming, etc.) 879 Returns declarations of policy and support commitments for given 880 ARKs. Declarations are returned in either a structured metadata 881 format or a human readable text format; sometimes one format may 882 serve both purposes. Policy subareas may be addressed in separate 883 requests, but the following areas should should be covered: object 884 permanence, object naming, object fragment addressing, and 885 operational service support. 887 The permanence declaration for an object is a rating defined with 888 respect to an identified permanence provider (guarantor), and may 889 include the following aspects. One permanence rating framework is 890 given in [NLMPerm]. 892 (a) "object availability" -- whether and how access to the 893 object is supported (e.g., online 24x7, or offline only), 895 (b) "identifier validity" -- under what conditions the 896 identifier will be or has been re-assigned, 898 (c) "content invariance" -- under what conditions the content of 899 the object is subject to change, and 901 (d) "change history" -- documentation, whether abbreviated or 902 detailed, of any or all corrections, migrations, revisions, etc. 904 Naming policy for an object includes an historical description of the 905 NAA's (and its successor NAA's) policies regarding differentiation of 906 objects. It may include the following aspects. 908 (e) "similarity" -- (or "unity") the limit, defined by the NAA, 909 to the level of dissimilarity beyond which two similar objects 910 warrant separate identifiers but before which they share one 911 single identifier, and 913 (f) "granularity" -- the limit, defined by the NAA, to the level 914 of object subdivision beyond which sub-objects do not warrant 915 separately assigned identifiers but before which sub-objects are 916 assigned separate identifiers. 918 Addressing policy for an object includes a description of how, during 919 access, object components (e.g., paragraphs, sections) or views 920 (e.g., image conversions) may or may not be "addressed", in other 921 words, how the NMA permits arguments or parameters to modify the 922 object delivered as the result of an ARK request. If supported, 923 these sorts of operations would provide things like byte-ranged 924 fragment delivery and open-ended format conversions, or any set of 925 possible transformations that would be too numerous to list or to 926 identify with separately assigned ARKs. 928 Operational service support policy includes a description of general 929 operational aspects of the NMA service, such as after-hours staffing 930 and trouble reporting procedures. 932 5.3. Generic Description Service 934 Returns a description of the object. Descriptions are returned in 935 either a structured metadata format or a human readable text format; 936 sometimes one format may serve both purposes. A description must at 937 a minimum answer the who, what, when, and where questions concerning 938 an expression of the object. Standalone descriptions should be 939 accompanied by the modification date and source of the description 940 itself. May also return discriminated lists of ARKs that are related 941 to the given ARK. 943 6. Overview of the HTTP Key Mapping Protocol (HKMP) 945 The HTTP Key Mapping Protocol (HKMP) is a way of taking a key (a kind 946 of identifier) and asking such questions as, what information does 947 this identify and how permanent is it? [HKMP] is in fact one 948 specific method under development for delivering ARK services. The 949 protocol runs over HTTP to exploit the web browser's current pre- 950 eminence as user interface to the Internet. HKMP is designed so that 951 a person can enter ARK requests directly into the location field of 952 current browser interfaces. Because it runs over HTTP, HKMP can be 953 simulated and tested within keyboard-based [TELNET] sessions. 955 The asker (a person or client program) starts with an identifier, 956 such as an ARK or a URL. The identifier reveals to the asker (or 957 allows the asker to infer) the Internet host name and port number of 958 a server system that responds to questions. Here, this is just the 959 NMAH that is obtained by inspection and possibly lookup based on the 960 ARK's NAAN. The asker then sets up an HTTP session with the server 961 system, sends a question via an HKMP request (contained within an 962 HTTP request), receives an answer via an HKMP response (contained 963 within an HTTP response), and closes the session. That concludes the 964 connected portion of the protocol. 966 An HKMP request is a string of characters beginning with a `?' 967 (question mark) that is appended to the identifier string. The 968 resulting string is sent as an argument to HTTP's GET command. 970 Request strings too long for GET may be sent using HTTP's POST 971 command. The three most common requests correspond to three 972 degenerate special cases that keep the user's learning and typing 973 burden low. First, a simple key with no request at all is the same 974 as an ordinary access request. Thus a plain ARK entered into a 975 browser's location field behaves much like a plain URL, and returns 976 access to the primary identified object, for instance, an HTML 977 document. 979 The second special case is a minimal ARK description request string 980 consisting of just "?". For example, entering the string, 982 ark.nlm.nih.gov/12025/psbbantu? 984 into the browser's location field directly precipitates a request for 985 a metadata record describing the object identified by ark:/12025/psb� 986 bantu. The browser, unaware of HKMP, prepares and sends an HTTP GET 987 request in the same manner as for a URL. HKMP is designed so that 988 the response (indicated by the returned HTTP content type) is nor� 989 mally displayed, whether the output is structured for machine pro� 990 cessing (text/plain) or formatted for human consumption (text/html). 992 In the following example HKMP session, each line has been annotated 993 to include a line number and whether it was the client or server that 994 sent it. Without going into much depth, the session has four pieces 995 separated from each other by blank lines: the client's piece (lines 996 1-3), the server's HTTP/HKMP response headers (4-7), and the body of 997 the server's response (8-17). The first and last lines (1 and 17) 998 correspond to the client's steps to start the TCP session and the 999 server's steps to end it, respectively. 1001 1 C: [opens session] 1002 C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu? HTTP/1.1 1003 C: 1004 S: HTTP/1.1 200 OK 1005 5 S: Content-Type: text/plain 1006 S: HKMP-Status: 0.1 200 OK 1007 S: 1008 S: |set: NLM | 12025/psbbantu? | 2003 07 31 1009 S: http://ark.nlm.nih.gov/ark:/12025/psbbantu? 1010 10 S: here: 1 | 1 | 1 1011 S: 1012 S: erc: 1013 S: who: Lederberg, Joshua 1014 S: what: Studies of Human Families for Genetic Linkage 1015 15 S: when: 1974 1016 S: where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1017 S: [closes session] 1019 The first two server response lines (4-5) above are typical of HTTP. 1020 The next line (6) is peculiar to HKMP, and indicates the HKMP version 1021 and a normal return status. The balance of the response consists of 1022 a record set header (lines 8-10) and a single metadata record (12-16) 1023 that comprises the ARK description service response. The record set 1024 header identifies (8-9) who created the set, what its title is, when 1025 it was created, and where an automated process can access the set, 1026 and ends in a line (10) indicating that here in this communication 1027 the recipient can expect to find records numbered 1 to 1 of a total 1028 of 1 record in the set (i.e., here is the entire set, consisting of 1029 exactly one record). 1031 The returned record (12-16) is in the format of an Electronic 1032 Resource Citation [ERC], which is discussed in more detail in the 1033 next section. For now, note that it contains four elements that 1034 answer the top priority questions regarding an expression of the 1035 object: who played a major role in expressing it, what the expres� 1036 sion was called, when is was created, and where the expression may be 1037 found. This quartet of elements comes up again and again in ERCs. 1039 The third degenerate special case of an ARK request (and no other 1040 cases will be described in this document) is the string "??", corre� 1041 sponding to a minimal permanence policy request. It can be seen in 1042 use appended to an ARK (on line 2) in the example session that fol� 1043 lows. 1045 1 C: [opens session] 1046 C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu?? HTTP/1.1 1047 C: 1048 S: HTTP/1.1 200 OK 1049 5 S: Content-Type: text/plain 1050 S: HKMP-Status: 0.1 200 OK 1051 S: 1052 S: |set: NLM | 12025/psbbantu?? | 2003 07 31 1053 S: http://ark.nlm.nih.gov/ark:/12025/psbbantu?? 1054 10 S: here: 1 | 1 | 1 1055 S: 1056 S: erc: 1057 S: who: Lederberg, Joshua 1058 S: what: Studies of Human Families for Genetic Linkage 1059 15 S: when: 1974 1060 S: where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1061 S: erc-support: 1062 S: who: USNLM 1063 S: what: Permanent, Unchanging Content 1064 20 S: when: 2001 04 21 1065 S: where: http://ark.nlm.nih.gov/yy22948 1066 S: [closes session] 1068 Again, a single metadata record (lines 12-21) is returned, but it 1069 consists of two segments. The first segment (12-16) gives the same 1070 basic citation information as in the previous example. It is 1071 returned in order to establish context for the persistence 1072 declaration in the second segment (17-21). 1074 Each segment in an ERC tells a different story relating to the 1075 object, so although the same four questions (elements) appear in 1076 each, the answers depend on the segment's story type. While the 1077 first segment tells the story of an expression of the object, the 1078 second segment tells the story of the support commitment made to it: 1079 who made the commitment, what the nature of the commitment was, when 1080 it was made, and where a fuller explanation of the commitment may be 1081 found. 1083 7. Overview of Electronic Resource Citations (ERCs) 1085 An Electronic Resource Citation (or ERC, pronounced e-r-c) [ERC] is a 1086 simple, compact, and printable record designed to hold data 1087 associated with an information resource. By design, the ERC is a 1088 metadata format that balances the needs for expressive power, very 1089 simple machine processing, and direct human manipulation. 1091 A founding principle of the ERC is that direct human contact with 1092 metadata will be a necessary and sufficient condition for the near 1093 term rapid development of metadata standards, systems, and services. 1094 Thus the machine-processable ERC format must only minimally strain 1095 people's ability to read, understand, change, and transmit ERCs 1096 without their relying on intermediation with specialized software 1097 tools. The basic ERC needs to be succinct, transparent, and 1098 trivially parseable by software. 1100 In the current Internet, it is natural seriously to consider using 1101 XML as an exchange format because of predictions that it will obviate 1102 many ad hoc formats and programs, and unify much of the world's 1103 information under one reliable data structuring discipline that is 1104 easy to generate, verify, parse, and render. It appears, however, 1105 that XML is still only catching on after years of standards work and 1106 implementation experience. The reasons for it are unclear, but for 1107 now very simple XML interpretation is still out of reach. Another 1108 important caution is that XML structures are hard on the eyeballs, 1109 taking up an amount of display (and page) space that significantly 1110 exceeds that of traditional formats. Until these conflicts with ERC 1111 principle are resolved, XML is not a first choice for representing 1112 ERCs. Borrowing instead from the data structuring format that 1113 underlies the successful spread of email and web services, the first 1114 ERC format is based on email and HTTP headers (RFC822) [EMHDRS]. 1115 There is a naturalness to its label-colon-value format (seen in the 1116 previous section) that barely needs explanation to a person beginning 1117 to enter ERC metadata. 1119 Besides simplicity of ERC system implementation and data entry 1120 mechanics, ERC semantics (what the record and its constituent parts 1121 mean) must also be easy to explain. ERC semantics are based on a 1122 reformulation and extension of the Dublin Core [DCORE] hypothesis, 1123 which suggests that the fifteen Dublin Core metadata elements have a 1124 key role to play in cross-domain resource description. The ERC 1125 design recognizes that the Dublin Core's primary contribution is the 1126 international, interdisciplinary consensus that identified fifteen 1127 semantic buckets (element categories), regardless of how they are 1128 labeled. The ERC then adds a definition for a record and some 1129 minimal compliance rules. In pursuing the limits of simplicity, the 1130 ERC design combines and relabels some Dublin Core buckets to isolate 1131 a tiny kernel (subset) of four elements for basic cross-domain 1132 resource description. 1134 For the cross-domain kernel, the ERC uses the four basic elements -- 1135 who, what, when, and where -- to pretend that every object in the 1136 universe can have a uniform minimal description. Each has a name or 1137 other identifier, a location, some responsible person or party, and a 1138 date. It doesn't matter what type of object it is, or whether one 1139 plans to read it, interact with it, smoke it, wear it, or navigate 1140 it. Of course, this approach is flawed because uniformity of 1141 description for some object types requires more semantic contortion 1142 and sacrifice than for others. That is why at the beginning of this 1143 document, the ARK was said to be suited to objects that accommodate 1144 reasonably regular electronic description. 1146 While insisting on uniformity at the most basic level provides 1147 powerful cross-domain leverage, the semantic sacrifice is great for 1148 many applications. So the ERC also permits a semantically rich and 1149 nuanced description to co-exist in a record along with a basic 1150 description. In that way both sophisticated and naive recipients of 1151 the record can extract the level of meaning from it that best suits 1152 their needs and abilities. Key to unlocking the richer description 1153 is a controlled vocabulary of ERC record types (not explained in this 1154 document) that permit knowledgeable recipients to apply defined sets 1155 of additional assumptions to the record. 1157 7.1. ERC Syntax 1159 An ERC record is a sequence of metadata elements ending in a blank 1160 line. An element consists of a label, a colon, and an optional 1161 value. Here is an example of a record with five elements. 1163 erc: 1164 who: Gibbon, Edward 1165 what: The Decline and Fall of the Roman Empire 1166 when: 1781 1167 where: http://www.ccel.org/g/gibbon/decline/ 1169 A long value may be folded (continued) onto the next line by insert� 1170 ing a newline and indenting the next line. A value can be thus 1171 folded across multiple lines. Here are two example elements, each 1172 folded across four lines. 1174 who/created: University of California, San Francisco, AIDS 1175 Program at San Francisco General Hospital | University 1176 of California, San Francisco, Center for AIDS Prevention 1177 Studies 1178 what/Topic: 1179 Heart Attack | Heart Failure 1180 | Heart 1181 Diseases 1183 An element value folded across several lines is treated as if the 1184 lines were joined together on one long line. For example, the second 1185 element from the previous example is considered equivalent to 1187 what/Topic: Heart Attack | Heart Failure | Heart Diseases 1189 An element value may contain multiple values, each one separated from 1190 the next by a `|' (pipe) character. The element from the previous 1191 example contains three values. 1193 For annotation purposes, any line beginning with a `#' (hash) charac� 1194 ter is treated as if it were not present; this is a "comment" line (a 1195 feature not available in email or HTTP headers). For example, the 1196 following element is spread across four lines and contains two val� 1197 ues: 1199 what/Topic: 1200 Heart Attack 1201 # | Heart Failure -- hold off until next review cycle 1202 | Heart Diseases 1204 7.2. ERC Stories 1206 An ERC record is organized into one or more distinct segments, where 1207 where each segment tells a story about a different aspect of the 1208 information resource. A segment boundary occurs whenever a segment 1209 label (an element beginning with "erc") is encountered. The basic 1210 label "erc:" introduces the story of an object's expression (e.g., 1211 its publication, installation, or performance). The label "erc- 1212 about:" introduces the story of an object's content (what it is 1213 about) and "erc-support:" introduces the story of a support 1214 commitment made to it. A story segment that concerns the ERC itself 1215 is introduced by the label "erc-from:". It is an important segment 1216 that tells the story of the ERC's provenance. Elements beginning 1217 with "erc" are reserved for segment labels and their associated story 1218 types. From an earlier example, here is an ERC with two segments. 1220 erc: 1221 who: Lederberg, Joshua 1222 what: Studies of Human Families for Genetic Linkage 1223 when: 1974 1224 where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1225 erc-support: 1226 who: NIH/NLM/LHNCBC 1227 what: Permanent, Unchanging Content 1228 # Note to ops staff: date needs verification. 1229 when: 2001 04 21 1230 where: http://ark.nlm.nih.gov/yy22948 1232 Segment stories are told according to journalistic tradition. While 1233 any number of pertinent elements may appear in a segment, priority is 1234 placed on answering the questions who, what, when, and where at the 1235 beginning of each segment so that readers can make the most important 1236 selection or rejection decisions as soon as possible. To make things 1237 simple, the listed ordering of the questions is maintained in each 1238 segment (as it happens most people who have been exposed to this 1239 story telling technique are already familiar with the above order� 1240 ing). 1242 The four questions are answered by using corresponding element 1243 labels. The four element labels can be re-used in each story seg� 1244 ment, but their meaning changes depending on the segment (the story 1245 type) in which they appear. In the example above, "who" is first 1246 used to name a document's author and subsequently used to name the 1247 permanence guarantor (provider). Similarly, "when" first lists the 1248 date of object creation and in the next segment lists the date of a 1249 commitment decision. Four labels appearing across three segments 1250 effectively map to twelve semantically distinct elements. Distinct 1251 element meanings are mapped to Dublin Core elements in a later sec� 1252 tion. 1254 7.3. The ERC Anchoring Story 1256 Each ERC contains an anchoring story. It is usually the first 1257 segment labeled "erc:" and it concerns an "anchoring" expression of 1258 the object. An "anchoring" expression is the one that a provider 1259 deemed the most suitable basic referent given the audience and 1260 application for which it produced the ERC. If it sounds like the 1261 provider has great latitude in choosing its anchoring expression, it 1262 is because it does. A typical anchoring story in an ERC for a born- 1263 digital document would be the story of the document's release on a 1264 web site; such a document would then be the anchoring expression. 1266 An anchoring story need not be the central descriptive goal of an ERC 1267 record. For example, a museum provider may create an ERC for a 1268 digitized photograph of a painting but choose to anchor it in the 1269 story of the original painting instead of the story of the electronic 1270 likeness; although the ERC may through other segments prove to be 1271 centrally concerned with describing the electronic likeness, the 1272 provider may have chosen this particular anchoring story in order to 1273 make the ERC visible in a way that is most natural to patrons (who 1274 would find the Mona Lisa under da Vinci sooner than they would find 1275 it under the name of the person who snapped the photograph or scanned 1276 the image). In another example, a provider that creates an ERC for a 1277 dramatic play as an abstract work has the task of describing a piece 1278 of intangible intellectual property. To anchor this abstract object 1279 in the concrete world, if only through a derivative expression, it 1280 makes sense for the provider to choose a suitable printed edition of 1281 the play as the anchoring object expression (to describe in the 1282 anchoring story) of the ERC. 1284 The anchoring story has special rules designed to keep ERC processing 1285 simple and predictable. Each of the four basic elements (who, what, 1286 when, and where) must be present, unless a best effort to supply it 1287 fails. In the event of failure, the element still appears but a 1288 special value (described later) is used to explain the missing value. 1289 While the requirement that each of the four elements be present only 1290 applies to the anchoring story segment, as usual these elements 1291 appear at the beginning of the segment and may only be used in the 1292 prescribed order. A minimal ERC would normally consist of just an 1293 anchoring story and the element quartet, as illustrated in the next 1294 example. 1296 erc: 1297 who: National Research Council 1298 what: The Digital Dilemma 1299 when: 2000 1300 where: http://books.nap.edu/html/digital%5Fdilemma 1302 A minimal ERC can be abbreviated so that it resembles a traditional 1303 compact bibliographic citation that is nonetheless completely machine 1304 processable. The required elements and ordering makes it possible to 1305 eliminate the element labels, as shown here. 1307 erc: National Research Council | The Digital Dilemma | 2000 1308 | http://books.nap.edu/html/digital%5Fdilemma 1310 7.4. ERC Elements 1312 As mentioned, the four basic ERC elements (who, what, when, and 1313 where) take on different specific meanings depending on the story 1314 segment in which they are used. By appearing in each segment, albeit 1315 in different guises, the four elements serve as a valuable mnemonic 1316 device -- a kind of checklist -- for constructing minimal story 1317 segments from scratch. Again, it is only in the anchoring segment 1318 that all four elements are mandatory. 1320 Here are some mappings between ERC elements and Dublin Core [DCORE] 1321 elements. 1323 Segment ERC Element Equivalent Dublin Core Element 1324 --------- ----------- ------------------------------ 1325 erc who Creator/Contributor/Publisher 1326 erc what Title 1327 erc when Date 1328 erc where Identifier 1329 erc-about who 1330 erc-about what Subject 1331 erc-about when Coverage (temporal) 1332 erc-about where Coverage (spatial) 1334 The basic element labels may also be qualified to add nuances to the 1335 semantic categories that they identify. Elements are qualified by 1336 appending a `/' (slash) and a qualifier term. Often qualifier terms 1337 appear as the past tense form of a verb because it makes re-using 1338 qualifiers among elements easier. 1340 who/published: ... 1341 when/published: ... 1342 where/published: ... 1344 Using past tense verbs for qualifiers also reminds providers and 1345 recipients that element values contain transient assertions that may 1346 have been true once, but that tend to become less true over time. 1347 Recipients that don't understand the meaning of a qualifier can fall 1348 back onto the semantic category (bucket) designated by the unquali� 1349 fied element label. Inevitably recipients (people and software) will 1350 have diverse abilities in understanding elements and qualifiers. 1352 Any number of other elements and qualifiers may be used in conjunc� 1353 tion with the quartet of basic segment questions. The only semantic 1354 requirement is that they pertain to the segment's story. Also, it is 1355 only the four basic elements that change meaning depending on their 1356 segment context. All other elements have meaning independent of the 1357 segment in which they appear. If an element label stripped of its 1358 qualifier is still not recognized by the recipient, a second fall 1359 back position is to ignore it and rely on the four basic elements. 1361 Elements may be either Canonical, Provisional, or Local. Canonical 1362 elements are officially recognized via a registry as part of the 1363 metadata vernacular. All elements, qualifiers, and segment labels 1364 used in this document up until now belong to that vernacular. Provi� 1365 sional elements are also officially recognized via the registry, but 1366 have only been proposed for inclusion in the vernacular. To be pro� 1367 moted to the vernacular, a provisional element passes through a vet� 1368 ting process during which its documentation must be in order and its 1369 community acceptance demonstrated. Local elements are any elements 1370 not officially recognized in the registry. The registry [REG] is a 1371 work in progress. 1373 Local elements can be immediately distinguishable from Canonical or 1374 Provisional elements because all terms that begin with an upper case 1375 letter are reserved for spontaneous local use. No term beginning 1376 with an upper case letter will ever be assigned Canonical or Provi� 1377 sional status, so it should be safe to use such terms for local pur� 1378 poses. Any recipient of external ERCs containing such terms will 1379 understand them to be part of the originating provider's local meta� 1380 data dialect. Here's an example ERC with three segments, one local 1381 element, and two local qualifiers. The segment boundaries have been 1382 emphasized by comment lines (which, as before, are ignored by proces� 1383 sors). 1385 erc: 1386 who: Bullock, TH | Achimowicz, JZ | Duckrow, RB 1387 | Spencer, SS | Iragui-Madoz, VJ 1388 what: Bicoherence of intracranial EEG in sleep, 1389 wakefulness and seizures 1390 when: 1997 12 00 1391 where: http://cogprints.soton.ac.uk/%{ 1392 documents/disk0/00/00/01/22/index.html %} 1393 in: EEG Clin Neurophysiol | 1997 12 00 | v103, i6, p661-678 1394 IDcode: cog00000122 1395 # ---- new segment ---- 1396 erc-about: 1397 what/Subcategory: Bispectrum | Nonlinearity | Epilepsy 1398 | Cooperativity | Subdural | Hippocampus | Higher moment 1399 # ---- new segment ---- 1400 erc-from: 1401 who: NIH/NLM/NCBI 1402 what: pm9546494 1403 when/Reviewed: 1998 04 18 021600 1404 where: http://ark.nlm.nih.gov/12025/pm9546494? 1406 The local element "IDcode" immediately precedes the "erc-about" seg� 1407 ment, which itself contains an element with the local qualifier "Sub� 1408 category". The second to last element also carries the local quali� 1409 fier "Reviewed". Finally, what might be a provisional element "in" 1410 appears near the end of the first segment. It might have been pro� 1411 posed as a way to complete a citation for an object originally 1412 appearing inside another object (such as an article appearing in a 1413 journal or an encyclopedia). 1415 7.5. ERC Element Values 1417 ERC element values tend to be straightforward strings. If the 1418 provider intends something special for an element, it will so 1419 indicate with markers at the beginning of its value string. The 1420 markers are designed to be uncommon enough that they would not likely 1421 occur in normal data except by deliberate intent. Markers can only 1422 occur near the beginning of a string, and once any octet of non- 1423 marker data has been encountered, no further marker processing is 1424 done for the element value. In the absence of markers the string is 1425 considered pure data; this has been the case with all the examples 1426 seen thus far. The fullest form of an element value with all three 1427 optional markers in place looks like this. 1429 VALUE = [markup_flags] (:ccode) , DATA 1431 In processing, the first non-whitespace character of an ERC element 1432 value is examined. An initial `[' is reserved to introduce a brack� 1433 eted set of markup flags (not described in this document) that ends 1434 with `]'. If ERC data is machine-generated, each value string may be 1435 preceded by "[]" to prevent any of its data from being mistaken for 1436 markup flags. Once past the optional markup, the remaining value may 1437 optionally begin with a controlled code. A controlled code always 1438 has the form "(:ccode)", for example, 1440 who: (:unkn) Anonymous 1441 what: (:791) Bee Stings 1443 Any string after such a code is taken to be an uncontrolled (e.g., 1444 natural language) equivalent. The code "unkn" indicates a conven� 1445 tional explanation for a missing value (stating that the value is 1446 unknown). The remainder of the string makes an equivalent statement 1447 in a form that the provider deemed most suitable to its (probably 1448 human) audience. The code "791" could be a fixed numeric topic iden� 1449 tifier within an unspecified topic vocabulary. Any code may be 1450 ignored by those that do not understand it. 1452 There are several codes to explain different ways in which a required 1453 element's value may go missing. 1455 (:unkn) unknown (e.g., Anonymous, Inconnue) 1456 (:unav) value unavailable indefinitely 1457 (:unac) temporarily inaccessible 1458 (:unap) not applicable, makes no sense 1459 (:unas) value unassigned (e.g., Untitled) 1460 (:none) never had a value, never will 1461 (:null) explicitly empty 1462 (:unal) unallowed, suppressed intentionally 1464 Once past an optional controlled code, the remaining string value is 1465 subjected to one final test. If the first next non-whitespace char� 1466 acter is a `,' (comma), it indicates that the string value is "sort- 1467 friendly". This means that the value is (a) laid out with an 1468 inverted word order useful for sorting items having comparably laid 1469 out element values (items might be the containing ERC records) and 1470 (b) that the value may contain other commas that indicate inversion 1471 points should it become necessary to recover the value in natural 1472 word order. Typically, this feature is used to express Western-style 1473 personal names in family-name-given-name order. It can also be used 1474 wherever natural word order might make sorting tricky, such as when 1475 data contains titles or corporate names. Here are some example ele� 1476 ments. 1478 who: , van Gogh, Vincent 1479 who:,Howell, III, PhD, 1922-1987, Thurston 1480 who:, Acme Rocket Factory, Inc., The 1481 who:, Mao Tse Tung 1482 who:, McCartney, Paul, Sir, 1483 what:, Health and Human Services, United States Government 1484 Department of, The, 1486 There are rules to use in recovering a copy of the value in natural 1487 word order, if desired. The above example strings have the following 1488 natural word order values, respectively. 1490 Vincent van Gogh 1491 Thurston Howell, III, PhD, 1922-1987 1492 The Acme Rocket Factory, Inc. 1493 Mao Tse Tung 1494 Sir Paul McCartney 1495 The United States Government Department of Health and Human Services 1497 7.6. ERC Element Encoding and Dates 1499 Some characters that need to appear in ERC element values might 1500 conflict with special characters used for structuring ERCs, so there 1501 needs to be a way to include them as literal characters that are 1502 protected from special interpretation. This is accomplished through 1503 an encoding mechanism that resembles the %-encoding familiar to [URI] 1504 handlers. 1506 The ERC encoding mechanism also uses `%', but instead of taking two 1507 following hexadecimal digits, it takes one non-alphanumeric character 1508 or two alphabetic characters that cannot be mistaken for hex digits. 1509 It is designed not to be confused with normal web-style %-encoding. 1510 In particular it can be decoded without risking unintended decoding 1511 of normal %-encoded data (which would introduce errors). Here are 1512 the one-character (non-alphanumeric) ERC encoding extensions. 1514 ERC Purpose 1515 --- ------------------------------------------------ 1516 %! decodes to the element separator `|' 1517 %% decodes to a percent sign `%' 1518 %. decodes to a comma `,' 1519 %_ a non-character used as syntax shim 1520 %{ a non-character that begins an expansion block 1521 %} a non-character that ends an expansion block 1523 One particularly useful construct in ERC element values is the pair 1524 of special encoding markers ("%{" and "%}") that indicates a 1525 "expansion" block. Whatever string of characters they enclose will 1526 be treated as if none of the contained whitespace (SPACEs, TABs, New� 1527 lines) were present. This comes in handy for writing long, multi- 1528 part URLs in a readable way. For example, the value in 1530 where: http://foo.bar.org/node%{ 1531 ? db = foo 1532 & start = 1 1533 & end = 5 1534 & buf = 2 1535 & query = foo + bar + zaf 1536 %} 1538 is decoded into an equivalent element, but with a correct and intact 1539 URL: 1541 where: 1542 http://foo.bar.org/node?db=foo&start=1&end=5&buf=2&query=foo+bar+zaf 1544 In a parting word about ERC element values, a commonly recurring 1545 value type is a date, possibly followed by a time. ERC dates take on 1546 one of the following forms: 1548 1999 (four digit year) 1549 2000 12 29 (year, month, day) 1550 2000 12 29 235955 (year, month, day, hour, minute, second) 1552 21 Spring 31 1st quarter 25 Spring (so. hemisphere) 22 Summer 32 1553 2nd quarter 26 Summer (so. hemisphere) 23 Fall 33 3rd 1554 quarter 27 Fall (so. hemisphere) 24 Winter 34 4th quar� 1555 ter 28 Winter (so. hemisphere) In dates, all internal whitespace 1556 is squeezed out to achieve a normalized form suitable for lexical 1557 comparison and sorting. This means that the following dates 1559 2000 12 29 235955 (recommended for readability) 1560 2000 12 29 23 59 55 1561 20001229 23 59 55 1562 20001229235955 (normalized date and time) 1564 are all equivalent. The first form is recommended for readability. 1565 The last form (shortest and easiest to compute with) is the normal� 1566 ized form. Hyphens and commas are reserved to create date ranges and 1567 lists, for example, 1569 1996-2000 (a range of four years) 1570 1952, 1957, 1969 (a list of three years) 1571 1952, 1958-1967, 1985 (a mixed list of dates and ranges) 1572 20001229-20001231 (a range of three days) 1574 7.7. ERC Stub Records and Internal Support 1576 The ERC design introduces the concept of a "stub" record, which is an 1577 incomplete ERC record intended to be supplemented with additional 1578 elements before being released as a standalone ERC record. A stub 1579 ERC record has no minimum required elements. It is just a group of 1580 elements that does not begin with "erc:" but otherwise conforms to 1581 the ERC record syntax. 1583 ERC stubs may be useful in supporting internal procedures using the 1584 ERC syntax. Often they rely on the convenience and accuracy of 1585 automatically supplied elements, even the basic ones. To be ready 1586 for external use, however, an ERC stub must be transformed into a 1587 complete ERC record having the usual required elements. An ERC stub 1588 record can be convenient for metadata embedded in a document, where 1589 elements such as location, modification date, and size -- which one 1590 would not omit from an externalized record -- are omitted simply 1591 because they are much better supplied by a computation. A separate 1592 local administrative procedure, not defined for ERC's in general, 1593 would effect the promotion of stubs into complete records. 1595 While the ERC is a general-purpose container for exchange of resource 1596 descriptions, it does not dictate how records must be internally 1597 stored, laid out, or assembled by data providers or recipients. 1598 Arbitrary internal descriptive frameworks can support ERCs simply by 1599 mapping (e.g., on demand) local records to the ERC container format 1600 and making them available for export. Therefore, to support ERCs 1601 there is no need for a data provider to convert internal data to be 1602 stored in an ERC format. On the other hand, any provider (such as 1603 one just getting started in the business of resource description) may 1604 choose to store and manipulate local data natively in the ERC format. 1606 8. Advice to Web Clients 1608 This section offers some advice to web client software developers. 1609 It is hard to write about because it tries to anticipate a series of 1610 events that might lead to native web browser support for ARKs. 1612 ARKs are envisaged to appear wherever durable object references are 1613 planned. Library cataloging records, literature citations, and 1614 bibliographies are important examples. In many of these places URLs 1615 (Uniform Resource Locators) currently stand in, and URNs, DOIs, and 1616 PURLs have been proposed as alternatives. 1618 The strings representing ARKs are also envisaged to appear in some of 1619 the places where URLs currently appear: in hypertext links (where 1620 they are not normally shown to users) and in rendered text (displayed 1621 or printed). Internet search engines, for example, tend to include 1622 both actionable and manifest links when listing each item found. A 1623 normal HTML link for which the URL is not displayed looks like this. 1625 Click Here 1627 The same link with an ARK instead of a URL: 1629 Click Here 1631 Web browsers would in general require a small modification to recog� 1632 nize and convert this ARK, via mapping authority discovery, to the 1633 URL form. 1635 Click Here 1637 A browser that knows how to make that conversion could also automati� 1638 cally detect and replace a non-working NMAH. 1640 An NAA will typically make known the associations it creates by pub� 1641 lishing them in catalogs, actively advertizing them, or simply leav� 1642 ing them on web sites for visitors (e.g., users, indexing spiders) to 1643 stumble across in browsing. 1645 9. Security Considerations 1647 The ARK naming scheme poses no direct risk to computers and networks. 1648 Implementors of ARK services need to be aware of security issues when 1649 querying networks and filesystems for Name Mapping Authority 1650 services, and the concomitant risks from spoofing and obtaining 1651 incorrect information. These risks are no greater for ARK mapping 1652 authority discovery than for other kinds of service discovery. For 1653 example, recipients of ARKs with a specified hostport (NMAH) should 1654 treat it like a URL and be aware that the identified ARK service may 1655 no longer be operational. 1657 Apart from mapping authority discovery, ARK clients and servers 1658 subject themselves to all the risks that accompany normal operation 1659 of the protocols underlying mapping services (e.g., HTTP, Z39.50). 1660 As specializations of such protocols, an ARK service may limit 1661 exposure to the usual risks. Indeed, ARK services may enhance a kind 1662 of security by helping users identify long-term reliable references 1663 to information objects. 1665 10. Authors' Addresses 1667 John A. Kunze 1668 California Digital Library 1669 University of California, Office of the President 1670 415 20th St, 4th Floor 1671 Oakland, CA 94612-3550, USA 1673 Fax: +1 510-893-5212 1674 EMail: jak@ucop.edu 1675 R. P. C. Rodgers 1676 US National Library of Medicine 1677 8600 Rockville Pike, Bldg. 38A 1678 Bethesda, MD 20894, USA 1680 Fax: +1 301-496-0673 1681 EMail: rodgers@nlm.nih.gov 1683 11. References 1685 [DCORE] Dublin Core Metadata Initiative, "Dublin Core Metadata 1686 Element Set, Version 1.1: Reference Description", July 1687 1999, http://dublincore.org/documents/dces/. 1689 [DNS] P.V. Mockapetris, "Domain Names - Concepts and 1690 Facilities", RFC 1034, November 1987. 1692 [DOI] International DOI Foundation, "The Digital Object 1693 Identifier (DOI) System", February 2001, 1694 http://dx.doi.org/10.1000/203. 1696 [EMHDRS] D. Crocker, "Standard for the format of ARPA Internet text 1697 messages", RFC 822, August 1982. 1699 [ERC] J. Kunze, "Electronic Resource Citations", work in 1700 progress. 1702 [HKMP] J. Kunze, "HTTP Key Mapping Protocol", work in progress. 1704 [HTTP] R. Fielding, et al, "Hypertext Transfer Protocol -- 1705 HTTP/1.1", RFC 2616, June 1999. 1707 [MD5] R. Rivest, "The MD5 Message-Digest Algorithm", RFC 1321, 1708 April 1992. 1710 [NAPTR] M. Mealling, Daniel, R., "The Naming Authority Pointer 1711 (NAPTR) DNS Resource Record", RFC 2915, September 2000. 1713 [NLMPerm] M. Byrnes, "Defining NLM's Commitment to the Permanence of 1714 Electronic Information", ARL 212:8-9, October 2000, 1715 http://www.arl.org/newsltr/212/nlm.html 1717 [PURL] K. Shafer, et al, "Introduction to Persistent Uniform 1718 Resource Locators", 1996, 1719 http://purl.oclc.org/OCLC/PURL/INET96 1721 [REG] J. Kunze, "Resource Metadata Vocabulary", work in 1722 progress. 1724 [URI] T. Berners-Lee, et al, "Uniform Resource Identifiers 1725 (URI): Generic Syntax", RFC 2396, August 1998. 1727 [URNBIB] C. Lynch, et al, "Using Existing Bibliographic Identifiers 1728 as Uniform Resource Names", RFC 2288, February 1998. 1730 [URNSYN] R. Moats, "URN Syntax", RFC 2141, May 1997. 1732 [URNNID] L. Daigle, et al, "URN Namespace Definition Mechanisms", 1733 RFC 2611, June 1999. 1735 [TELNET] J. Postel, J.K. Reynolds, "Telnet Protocol Specification", 1736 RFC 854, May 1983. 1738 12. Appendix: An NLM Prototype ARK Service 1740 The US National Library of Medicine (NLM) has an experimental, 1741 prototype ARK service under development. It is being made available 1742 for purposes of demonstrating various aspects of the ARK system, but 1743 is subject to temporary or permanent withdrawal (without notice) 1744 depending upon the circumstances of the small research group 1745 responsible for making it available. It is described at: 1747 http://ark.nlm.nih.gov/ 1749 Comments and feedback may be addressed to rodgers@nlm.nih.gov. 1751 13. Appendix: Current ARK Name Authority Table 1753 This appendix contains a copy of the Name Authority Table (a file) at 1754 the time of writing. It may be loaded into a local filesystem (e.g., 1755 /etc/natab) for use in mapping NAAs (Name Assigning Authorities) to 1756 NMAHs (Name Mapping Authority Hostports). It contains Perl code that 1757 can be copied into a standalone script that processes the table (as a 1758 file). Because this is still a proposed file, none of the values in 1759 it are real. 1761 # 1762 # Name Assigning Authority / Name Mapping Authority Lookup Table 1763 # Last change: 31 July 2003 1764 # Reload from: http://ark.nlm.nih.gov/etc/natab 1765 # Mirrored at: http://ark.cdlib.org/natab 1766 # To register: mailto:jak@ucop.edu?Subject=naareg 1767 # Process with: Perl script at end of this file (optional) 1768 # 1769 # Each NAA appears at the beginning of a line with the NAA Number 1770 # first, a colon, and an ARK or URL to a statement of naming policy 1771 # (see http://ark.cdlib.org for an example). 1772 # All the NMA hostports that service an NAA are listed, one per 1773 # line, indented, after the corresponding NAA line. 1774 # 1775 # National Library of Medicine 1776 12025: http://www.nlm.nih.gov/xxx/naapolicy.html 1777 ark.nlm.nih.gov USNLM 1778 foobar.zaf.org UCSF 1779 sneezy.dopey.com BIREME 1780 # 1781 # Library of Congress 1782 12026: http://www.loc.gov/xxx/naapolicy.html 1783 foobar.zaf.org USLC 1784 sneezy.dopey.com USLC 1785 # 1786 # National Agriculture Library 1787 12027: http://www.nal.gov/xxx/naapolicy.html 1788 foobar.zaf.gov:80 USNAL 1789 # 1790 # University of California 1791 13030: http://ark.cdlib.org/ 1792 ark.cdlib.org CDL 1793 # 1794 # World Intellectual Property Organization 1795 13038: http://www.wipo.int/xxx/naapolicy.html 1796 www.wipo.int WIPO 1797 # 1798 #--- end of data --- 1799 # The following Perl script takes an NAA as argument and outputs 1800 # the NMAs in this file listed under any matching NAA. 1801 # 1802 # my $naa = shift; 1803 # while (<>) { 1804 # next if (! /^$naa:/); 1805 # while (<>) { 1806 # last if (! /^[#\s]./); 1807 # print "$1\n" if (/^\s+(\S+)/); 1808 # } 1809 # } 1810 # 1811 # Create a g/t/nroff-safe version of this table with the UNIX command, 1812 # 1813 # expand natab | sed 's/\\/\\\e/g' > natab.roff 1814 # 1815 # end of file 1817 14. Copyright Notice 1819 Copyright (C) The Internet Society (2003). All Rights Reserved. 1821 This document and translations of it may be copied and furnished to 1822 others, and derivative works that comment on or otherwise explain it 1823 or assist in its implementation may be prepared, copied, published 1824 and distributed, in whole or in part, without restriction of any 1825 kind, provided that the above copyright notice and this paragraph are 1826 included on all such copies and derivative works. However, this 1827 document itself may not be modified in any way, such as by removing 1828 the copyright notice or references to the Internet Society or other 1829 Internet organizations, except as needed for the purpose of 1830 developing Internet standards in which case the procedures for 1831 copyrights defined in the Internet Standards process must be 1832 followed, or as required to translate it into languages other than 1833 English. 1835 The limited permissions granted above are perpetual and will not be 1836 revoked by the Internet Society or its successors or assigns. 1838 This document and the information contained herein is provided on an 1839 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 1840 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 1841 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 1842 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 1843 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1845 The IETF invites any interested party to bring to its attention any 1846 copyrights, patents or patent applications, or other proprietary 1847 rights which may cover technology that may be required to practice 1848 this standard. Please address the information to the IETF Executive 1849 Director. 1851 Expires 31 January 2004 1852 Table of Contents 1854 Status of this Document . . . . . . . . . . . . . . . . . . . . . . 1 1855 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1856 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1857 1.1. Three Reasons to Use ARKs . . . . . . . . . . . . . . . . . . 3 1858 1.2. Organizing Support for ARKs . . . . . . . . . . . . . . . . . 4 1859 1.3. A Definition of Identifier . . . . . . . . . . . . . . . . . . 5 1860 2. ARK Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1861 2.1. The Name Mapping Authority Hostport (NMAH) . . . . . . . . . . 6 1862 2.2. The Name Assigning Authority Number (NAAN) . . . . . . . . . . 7 1863 2.3. The Name Part . . . . . . . . . . . . . . . . . . . . . . . . 7 1864 2.3.1. Names that Reveal Object Hierarchy . . . . . . . . . . . . . 8 1865 2.3.2. Names that Reveal Object Variants . . . . . . . . . . . . . 9 1866 2.3.3. Hyphens are Ignored . . . . . . . . . . . . . . . . . . . . 10 1867 2.4. Normalization and Lexical Equivalence . . . . . . . . . . . . 10 1868 2.5. Naming Considerations . . . . . . . . . . . . . . . . . . . . 11 1869 3. Assigners of ARKs . . . . . . . . . . . . . . . . . . . . . . . 12 1870 4. Finding a Name Mapping Authority . . . . . . . . . . . . . . . . 13 1871 4.1. Looking Up NMAHs in a Globally Accessible File . . . . . . . . 14 1872 4.2. Looking up NMAHs Distributed via DNS . . . . . . . . . . . . . 17 1873 5. Generic ARK Service Definition . . . . . . . . . . . . . . . . . 19 1874 5.1. Generic ARK Access Service (access, location) . . . . . . . . 19 1875 5.2. Generic Policy Service (permanence, naming, etc.) . . . . . . 20 1876 5.3. Generic Description Service . . . . . . . . . . . . . . . . . 21 1877 6. Overview of the HTTP Key Mapping Protocol (HKMP) . . . . . . . . 21 1878 7. Overview of Electronic Resource Citations (ERCs) . . . . . . . . 24 1879 7.1. ERC Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1880 7.2. ERC Stories . . . . . . . . . . . . . . . . . . . . . . . . . 26 1881 7.3. The ERC Anchoring Story . . . . . . . . . . . . . . . . . . . 27 1882 7.4. ERC Elements . . . . . . . . . . . . . . . . . . . . . . . . . 28 1883 7.5. ERC Element Values . . . . . . . . . . . . . . . . . . . . . . 30 1884 7.6. ERC Element Encoding and Dates . . . . . . . . . . . . . . . . 32 1885 7.7. ERC Stub Records and Internal Support . . . . . . . . . . . . 34 1886 8. Advice to Web Clients . . . . . . . . . . . . . . . . . . . . . 34 1887 9. Security Considerations . . . . . . . . . . . . . . . . . . . . 35 1888 10. Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 35 1889 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 36 1890 12. Appendix: An NLM Prototype ARK Service . . . . . . . . . . . . 37 1891 13. Appendix: Current ARK Name Authority Table . . . . . . . . . . 37 1892 14. Copyright Notice . . . . . . . . . . . . . . . . . . . . . . . 39