idnits 2.17.1 

draft-kunze-ark-14.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 17.

  -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
     line 2370.

  ** The document seems to lack an RFC 3979 Section 5, para. 1 IPR Disclosure
     Acknowledgement. 

  ** The document seems to lack an RFC 3979 Section 5, para. 2 IPR Disclosure
     Acknowledgement. 

  ** The document seems to lack an RFC 3979 Section 5, para. 3 IPR Disclosure
     Invitation. 


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 51
     longer pages, the longest (page 2) being 63 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 51 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 8 instances of too long lines in the document, the longest one
     being 21 characters in excess of 72.

  ** The abstract seems to contain references ([Qualifier]), which it
     shouldn't.  Please replace those with straight textual mentions of the
     documents in question.

  == There are 8 instances of lines with non-RFC2606-compliant FQDNs in the
     document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust Copyright Line does not match the
     current year

  == Line 1138 has weird spacing: '... regexp  repla...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (24 July 2007) is 6119 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'Qualifier' is mentioned on line 437, but not defined

  == Unused Reference: 'MD5' is defined on line 2143, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ANVL'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ARK'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DCORE'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DOI'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ERC'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Handle'

  ** Obsolete normative reference: RFC 2616 (ref. 'HTTP') (Obsoleted by RFC
     7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Kernel'

  ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref.
     'MD5')

  -- Possible downref: Non-RFC (?) normative reference: ref. 'N2T'

  ** Obsolete normative reference: RFC 2915 (ref. 'NAPTR') (Obsoleted by RFC
     3401, RFC 3402, RFC 3403, RFC 3404)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'NLMPerm'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'NOID'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'PURL'

  ** Obsolete normative reference: RFC  822 (Obsoleted by RFC 2822)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'TEMPER'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'THUMP'

  ** Obsolete normative reference: RFC 2396 (ref. 'URI') (Obsoleted by RFC
     3986)

  ** Downref: Normative reference to an Informational RFC: RFC 2288 (ref.
     'URNBIB')

  ** Obsolete normative reference: RFC 2141 (ref. 'URNSYN') (Obsoleted by RFC
     8141)

  ** Obsolete normative reference: RFC 2611 (ref. 'URNNID') (Obsoleted by RFC
     3406)


     Summary: 17 errors (**), 0 flaws (~~), 8 warnings (==), 18 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet-Draft: draft-kunze-ark-14.txt                          J. Kunze
3	ARK Identifier Scheme                    University of California (UCOP)
4	Expires 24 January 2008                                 R. P. C. Rodgers
5	                                         US National Library of Medicine
6	                                                            24 July 2007

8	                  The ARK Persistent Identifier Scheme

10	      (http://www.ietf.org/internet-drafts/draft-kunze-ark-14.txt)

12	Status of this Document

14	   By submitting this Internet-Draft, each author represents that any
15	   applicable patent or other IPR claims of which he or she is aware
16	   have been or will be disclosed, and any of which he or she becomes
17	   aware will be disclosed, in accordance with Section 6 of BCP 79.

19	   Internet-Drafts are working documents of the Internet Engineering
20	   Task Force (IETF), its areas, and its working groups.  Note that
21	   other groups may also distribute working documents as Internet-
22	   Drafts.

24	   Internet-Drafts are draft documents valid for a maximum of six months
25	   and may be updated, replaced, or obsoleted by other documents at any
26	   time.  It is inappropriate to use Internet-Drafts as reference
27	   material or to cite them other than as "work in progress."

29	   The list of current Internet-Drafts can be accessed at
30	   http://www.ietf.org/1id-abstracts.html

32	   The list of Internet-Draft Shadow Directories can be accessed at
33	   http://www.ietf.org/shadow.html

35	   Distribution of this document is unlimited.  Please send comments to
36	   jak@ucop.edu

38	   Copyright (C) The IETF Trust (2007).  All Rights Reserved.

40	Abstract

42	   The ARK (Archival Resource Key) naming scheme is designed to
43	   facilitate the high-quality and persistent identification of
44	   information objects. A founding principle of the ARK is that
45	   persistence is purely a matter of service and is neither inherent in
46	   an object nor conferred on it by a particular naming syntax. The best
47	   that an identifier can do is to lead users to the services that
48	   support persistence. The term ARK itself refers both to the scheme
49	   and to any single identifier that conforms to it.  An ARK has five
50	   components:

52	              [http://NMAH/]ark:/NAAN/Name[Qualifier]

54	   an optional and mutable Name Mapping Authority Hostport, the "ark:"
55	   label, the Name Assigning Authority Number (NAAN), the assigned Name,
56	   and an optional and possibly mutable Qualifier supported by the NMA.
57	   The NAAN and Name together form the immutable persistent identifier
58	   for the object.  An ARK is a special kind of URL that connects users
59	   to three things: the named object, its metadata, and the provider's
60	   promise about its persistence. When entered into the location field
61	   of a Web browser, the ARK leads the user to the named object. That
62	   same ARK, followed by a single question mark ('?'), returns a brief
63	   metadata record that is both human- and machine-readable. When the
64	   ARK is followed by dual question marks ('??'), the returned metadata
65	   contains a commitment statement from the current provider.  Tools
66	   exist for minting, binding, and resolving ARKs.

68	1.  Introduction

70	   This document describes a scheme for the high-quality naming of
71	   information resources.  The scheme, called the Archival Resource Key
72	   (ARK), is well suited to long-term access and identification of any
73	   information resources that accommodate reasonably regular electronic
74	   description.  This includes digital documents, databases, software,
75	   and websites, as well as physical objects (books, bones, statues,
76	   etc.) and intangible objects (chemicals, diseases, vocabulary terms,
77	   performances).  Hereafter the term "object" refers to an information
78	   resource.  The term ARK itself refers both to the scheme and to any
79	   single identifier that conforms to it.  A reasonably concise and
80	   accessible overview and rationale for the scheme is available at
81	   [ARK].

83	   Schemes for persistent identification of network-accessible objects
84	   are not new.  In the early 1990's, the design of the Uniform Resource
85	   Name [URNSYN] responded to the observed failure rate of URLs by
86	   articulating an indirect, non-hostname-based naming scheme and the
87	   need for responsible name management.  Meanwhile, promoters of the
88	   Digital Object Identifier [DOI] succeeded in building a community of
89	   providers around a mature software system [Handle] that supports name
90	   management.  The Persistent Uniform Resource Locator [PURL] was
91	   another scheme that has the unique advantage of working with
92	   unmodified web browsers.  ARKs represent an approach that attempts to
93	   build on the strengths and to avoid the weaknesses of the other
94	   schemes.

96	   A founding principle of the ARK is that persistence is purely a
97	   matter of service.  Persistence is neither inherent in an object nor
98	   conferred on it by a particular naming syntax.  Nor is the technique
99	   of name indirection - upon which URNs, Handles, DOIs, and PURLs are
100	   founded - of central importance.  Name indirection is an ancient and
101	   well-understood practice; new mechanisms for it keep appearing and
102	   distracting practitioner attention, with the Domain Name System [DNS]
103	   being a particularly dazzling and elegant example.  What is often
104	   forgotten is that maintenance of an indirection table is the
105	   overwhelming and unavoidable cost to the organization providing
106	   persistence, and the cost is equivalent across naming schemes.  That
107	   indirection has always been a native part of the web while being so
108	   lightly utilized for the persistence of web-based objects is an
109	   indication of how unsuited most organizations are to the task of
110	   table maintenance and to the overall challenge of digital permanence.

112	   Persistence is achieved through a provider's successful stewardship
113	   of objects and their identifiers.  The highest level of persistence
114	   will be reinforced by a provider's robust contingency, redundancy,
115	   and succession strategies.  It is further safeguarded to the extent
116	   that a provider's mission is shielded from marketplace and political
117	   instabilities.  These are by far the major challenges confronting
118	   persistence providers, and no identifier scheme has any direct impact
119	   on them.  In fact, some schemes may be actual liabilities for
120	   persistence because they create short- and long-term dependencies for
121	   every object access on complex, special-purpose local and global
122	   infrastructures, parts of which are proprietary and all of which
123	   increase the carry-forward burden for the preservation community.  It
124	   is for this reason that the ARK scheme relies only on educated name
125	   assignment and light use of general-purpose infrastructures that the
126	   entire internet community needs (the DNS, web servers, and web
127	   browsers) and that one can reasonably expect many others to help
128	   carry forward into the technologically evolving future.

130	1.1.  Reasons to Use ARKs

132	   If no persistent identifier scheme contributes directly to
133	   persistence, why not just use URLs?  A particular URL may be as
134	   durable an identifier as it is possible to have, but nothing
135	   distinguishes it from an ordinary URL to the recipient who is
136	   wondering if it is suitable for long-term reference.  An ARK is just
137	   a URL, distinguished by its form, that provides some of the necessary
138	   conditions for credible persistence.  An ARK invites access to not
139	   one, but to three things:  to the object, to its metadata, and to a
140	   nuanced statement of commitment from the provider regarding the
141	   object.  Existence of the two extra services can be probed
142	   automatically by appending either `?' or `??' to the ARK.

144	   The form of the ARK also supports the natural separation of naming
145	   authorities into the original name assigning authority and the
146	   diverse multiple name mapping (or servicing) authorities that in
147	   succession and in parallel will take over custodial responsibilities
148	   from the original assigner for the large majority of a long-term
149	   object's archival lifetime.  The mapping authority, indicated by the
150	   hostname part of the URL that contains the ARK, serves to launch the
151	   ARK into cyberspace.  Should it ever fail (and there is no reason why
152	   a well-chosen hostname of a 100-year-old cultural memory institution
153	   shouldn't last as long as the DNS), that host name is considered
154	   disposeable and replaceable.  Again, the form of the ARK helps
155	   because it defines exactly how to recover the core immutable object
156	   identity, and several simple algorithms (based on the URN model) are
157	   defined for locating another mapping authority.

159	   There are tools to assist in generating ARKs and other identifiers,
160	   such as [NOID] and "uuidgen", both of which rely for uniqueness on
161	   human-maintained registries.  This document also contains some
162	   guidelines and considerations for managing namespaces and choosing
163	   hostnames wisely.

165	1.2.  Three Requirements of ARKs

167	   The first requirement of an ARK is to give users a link from an
168	   object to a promise of stewardship for it.  That promise is a multi-
169	   faceted covenant that binds the word of an identified service
170	   provider to a specific set of responsibilities.  No one can tell if
171	   successful stewardship will take place because no one can predict the
172	   future.  Reasonable conjecture, however, may be based on past
173	   performance.  There must be a way to tie a promise of persistence to
174	   a provider's demonstrated or perceived ability - its reputation - in
175	   that arena.  Provider reputations would then rise and fall as
176	   promises are observed variously to be kept and broken.  This is
177	   perhaps the best way we have for gauging the strength of any
178	   persistence promise.  Note that over time, current providers have
179	   nothing to do with the intentions of the original assigners of names.

181	   The second requirement of an ARK is to give users a link from an
182	   object to a description of it.  The problem with a naked identifier
183	   is that without a description real identification is incomplete.
184	   Identifiers common today are relatively opaque, though some contain
185	   ad hoc clues that reflect brief life cycle periods such as the
186	   address of a short stay in a filesystem hierarchy.  Possession of
187	   both an identifier and an object is some improvement, but positive
188	   identification may still be uncertain since the object itself might
189	   not include a matching identifier or might not carry evidence obvious
190	   enough to reveal its identity without significant research.  In
191	   either case, what is called for is a record bearing witness to the
192	   identifier's association with the object, as supported by a recorded
193	   set of object characteristics.  This descriptive record is partly an
194	   identification "receipt" with which users and archivists can verify
195	   an object's identity after brief inspection and a plausible match
196	   with recorded characteristics such as title and size.

198	   The final requirement of an ARK is to give users a link to the object
199	   itself (or to a copy) if at all possible.  Persistent access is the
200	   central duty of an ARK.  Persistent identification plays a vital
201	   supporting role but, strictly speaking, it can be construed as no
202	   more than a record attesting to the original assignment of a never-
203	   reassigned identifier.  Object access may not be feasible for various
204	   reasons, such as catastrophic loss of the object, a licensing
205	   agreement that keeps an archive "dark" for a period of years, or when
206	   an object's own lack of tangible existence confuses normal concepts
207	   of access (e.g., a vocabulary term might be accessed through its
208	   definition).  In such cases the ARK's identification role assumes a
209	   much higher profile.  But attempts to simplify the persistence
210	   problem by decoupling access from identification and concentrating
211	   exclusively on the latter are of questionable utility.  A perfect
212	   system for assigning forever unique identifiers might be created, but
213	   if it did so without reducing access failure rates, no one would be
214	   interested.  The central issue - which may be summed up as the "HTTP
215	   404 Not Found" problem - would not have been addressed.

217	1.3.  Organizing Support for ARKs:  Our Stuff vs. Their Stuff

219	   An organization and the user community it serves can often be seen to
220	   struggle with two different areas of persistent identification: the
221	   Our Stuff problem and the Their Stuff problem.  In the Our Stuff
222	   problem, we in the organization want our own objects to acquire
223	   persistent names.  Since we possess or control these objects, our
224	   organization tackles the Our Stuff problem directly.  Whether or not
225	   the objects are named by ARKs, our organization is the responsible
226	   party, so it can plan for, maintain, and make commitments about the
227	   objects.

229	   In the Their Stuff problem, we in the organization want others'
230	   objects to acquire persistent names.  These are objects that we do
231	   not own or control, but some of which are critically important to us.
232	   But because they are beyond our influence as far as support is
233	   concerned, creating and maintaining persistent identifiers for Their
234	   Stuff is not especially purposeful or feasible for us to do.  There
235	   is little that we can do about someone else's stuff except encourage
236	   them to find or become providers of persistence services.

238	   Co-location of persistent access and identification services is
239	   natural.  Any organization that undertakes ongoing support of true
240	   persistent identification (which includes description) is well-served
241	   if it controls, owns, or otherwise has clear internal access to the
242	   identified objects, and this gives it an advantage if it wishes also
243	   to support persistent access to outsiders.  Conversely, persistent
244	   access to outsiders requires orderly internal collection management
245	   procedures that include monitoring, acquisition, verification, and
246	   change control over objects, which in turn requires object
247	   identifiers persistent enough to support auditable record keeping
248	   practices.

250	   Although, organizing ARK services under one roof thus tends to make
251	   sense, object hosting can successfully be separated from name
252	   mapping.  An example is when a name mapping authority centrally
253	   provides uniform resolution services via a protocol gateway on behalf
254	   of organizations that host objects behind a variety of access
255	   protocols.  It is also reasonable to build value-added description
256	   services that rely on the underlying services of a set of mapping
257	   authorities.

259	   Supporting ARKs is not for every organization.  By requiring
260	   specific, revealed commitments to preservation, to object access, and
261	   to description, the bar for providing ARK services is higher than for
262	   some other identifier schemes.  On the other hand, it would be hard
263	   to grant credence to a persistence promise from an organization that
264	   could not muster the minimum ARK services.  Not that there isn't a
265	   business model for an ARK-like, description-only service built on top
266	   of another organization's full complement of ARK services.  For
267	   example, there might be competition at the description level for
268	   abstracting and indexing a body of scientific literature archived in
269	   a combination of open and fee-based repositories.  The description-
270	   only service would have no direct commitment to the objects, but
271	   would act as an intermediary, forwarding commitment statements from
272	   object hosting services to requestors.

274	1.4.  Definition of Identifier

276	   An identifier is not a string of character data - an identifier is an
277	   association between a string of data and an object.  This abstraction
278	   is necessary because without it a string is just data.  It's nonsense
279	   to talk about a string's breaking, or about its being strong,
280	   maintained, and authentic.  But as a representative of an
281	   association, a string can do, metaphorically, the things that we
282	   expect of it.

284	   Without regard to whether an object is physical, digital, or
285	   conceptual, to identify it is to claim an association between it and
286	   a representative string, such as "Jane" or "ISBN 0596000278".  What
287	   gives a claim credibility is a set of verifiable assertions, or
288	   metadata, about the object, such as age, height, title, or number of
289	   pages.  In other words, the association is made manifest by a record
290	   (e.g., a cataloging or other metadata record) that vouches for it.

292	   In the complete absence of any testimony (metadata) regarding an
293	   association, a would-be identifier string is a meaningless sequence
294	   of characters.  To keep an externally visible but otherwise internal
295	   string from being perceived as an identifier by outsiders, for
296	   example, it suffices for an organization not to disclose the nature
297	   of its association.  For our immediate purpose, actual existence of
298	   an association record is more important than its authenticity or
299	   verifiability, which are outside the scope of this specification.

301	   It is a gift to the identification process if an object carries its
302	   own name as an inseparable part of itself, such as an identifier
303	   imprinted on the first page of a document or embedded in a data
304	   structure element of a digital document header.  In cases where the
305	   object is large, unwieldy, or unavailable (such as when licensing
306	   restrictions are in effect), a metadata record that includes the
307	   identifier string will usually suffice.  That record becomes a
308	   conveniently manipulable object surrogate, acting as both an
309	   association "receipt" and "declaration".

311	   Note that our definition of identifier extends the one in use for
312	   Uniform Resource Identifiers [URI].  The present document still
313	   sometimes (ab)uses the terms "ARK" and "identifier" as shorthand for
314	   the string part of an identifier, but the context should make the
315	   meaning clear.

317	2.  ARK Anatomy

319	   An ARK is represented by a sequence of characters (a string) that
320	   contains the label, "ark:", optionally preceded by the beginning part
321	   of a URL.  Here is a diagrammed example.

323	         http://foobar.zaf.org/ark:/12025/654xz321/s3/f8.05v.tiff
324	         \___________________/ \__/ \___/ \______/ \____________/
325	           (replaceable)        |     |      |       Qualifier
326	                |         ARK Label   |      |    (NMA-supported)
327	                |                     |      |
328	      Name Mapping Authority          |    Name (NAA-assigned)
329	         Hostport (NMAH)              |
330	                           Name Assigning Authority Number (NAAN)

332	   The ARK syntax can be summarized,

334	                    [http://NMAH/]ark:/NAAN/Name[Qualifier]

336	   where the NMAH and Qualifier parts are in brackets to indicate that
337	   they are optional.

339	2.1.  The Name Mapping Authority Hostport (NMAH)

341	   Before the "ark:" label may appear an optional Name Mapping Authority
342	   Hostport (NMAH) that is a temporary address where ARK service
343	   requests may be sent.  It consists of "http://" (or any service
344	   specification valid for a URL) followed by an Internet hostname or
345	   hostport combination having the same format and semantics as the
346	   hostport part of a URL.  The most important thing about the NMAH is
347	   that it is "identity inert" from the point of view of object
348	   identification.  In other words, ARKs that differ only in the
349	   optional NMAH part identify the same object.  Thus, for example, the
350	   following three ARKs are synonyms for just one information object:

352	                      http://loc.gov/ark:/12025/654xz321
353	                  http://rutgers.edu/ark:/12025/654xz321
354	                                     ark:/12025/654xz321

356	   Strictly speaking, in the realm of digital objects, these ARKs may
357	   lead over time to somewhat different or diverging instances of the
358	   originally named object.  In an ideal world, divergence of persistent
359	   objects is not desirable, but it is widely believed that digital
360	   preservation efforts will inevitably lead to alterations in some
361	   original objects (e.g, a format migration in order to preserve the
362	   ability to display a document).  If any of those objects are held
363	   redundantly in more than one organization (a common preservation
364	   strategy), chances are small that all holding organizations will
365	   perform the same precise transformations and all maintain the same
366	   object metadata.  More significant divergence would be expected when
367	   the holding organizations serve different audiences or compete with
368	   each other.

370	   The NMAH part makes an ARK into an actionable URL.  As with many
371	   internet parameters, it is helpful to approach the NMAH being liberal
372	   in what you accept and conservative in what you propose.  From the
373	   recipient's point of view, the NMAH part should be treated as
374	   temporary, disposable, and replaceable.  From the NMA's point of
375	   view, it should be chosen with the greatest concern for longevity.  A
376	   carefully chosen NMAH should be at least as permanent as the
377	   providing organization's own hostname.  In the case of a national or
378	   university library, for example, there is no reason why the NMAH
379	   should not be considerably more permanent than soft-funded proxy
380	   hostnames such as hdl.handle.net, dx.doi.org, and purl.org.  In
381	   general and over time, however, it is not unexpected for an NMAH
382	   eventually to stop working and require replacement with the NMAH of a
383	   currently active service provider.

385	   This replacement relies on a mapping authority "resolver" discovery
386	   process, of which two alternate methods are outlined in a later
387	   section.  The ARK, URN, Handle, and DOI schemes all use a resolver
388	   discovery model that sooner or later requires matching the original
389	   assigning authority with a current provider servicing that
390	   authority's named objects; once found, the resolver at that provider
391	   performs what amounts to a redirect to a place where the object is
392	   currently held.  All the schemes rely on the ongoing functionality of
393	   currently mainstream technologies such as the Domain Name System
394	   [DNS] and web browsers.  The Handle and DOI schemes in addition
395	   require that the Handle protocol layer and global server grid be
396	   available at all times.

398	   The practice of prepending "http://" and an NMAH to an ARK is a way
399	   of creating an actionable identifier by a method that is itself
400	   temporary.  Assuming that infrastructure supporting [HTTP]
401	   information retrieval will no longer be available one day, ARKs will
402	   then have to be converted into new kinds of actionable identifiers.
403	   By that time, if ARKs see widespread use, web browsers would
404	   presumably evolve to perform this (currently simple) transformation
405	   automatically.

407	2.2.  The ARK Label Part - ark:

409	   The label part distinguishes an ARK from an ordinary identifier.  In
410	   a URL found in the wild, the string, "ark:/", indicates that the URL
411	   stands a reasonable chance of being an ARK.  If the context warrants,
412	   verification that it actually is an ARK can be done by testing it for
413	   existence of the three ARK services.

415	   Since nothing about an identifier syntax directly affects
416	   persistence, the "ark:" label (like "urn:", "doi:", and "hdl:")
417	   cannot tell you whether the identifier is persistent or whether the
418	   object is available.  It does tell you that the original Name
419	   Assigning Authority (NAA) had some sort of hopes for it, but it
420	   doesn't tell you whether that NAA is still in existence, or whether a
421	   decade ago it ceased to have any responsibility for providing
422	   persistence, or whether it ever had any responsibility beyond naming.

424	   Only a current provider can say for certain what sort of commitment
425	   it intends, and the ARK label suggests that you can query the NMAH
426	   directly to find out exactly what kind of persistence is promised.
427	   Even if what is promised is impersistence (i.e., a short-term
428	   identifier), saying so is valuable information to the recipient.
429	   Thus an ARK is a high-functioning identifier in the sense that it
430	   provides access to the object, the metadata, and a commitment
431	   statement, even if the commitment is explicitly very weak.

433	2.3.  The Name Assigning Authority Number (NAAN)

435	   Recalling that the general form of the ARK is,

437	                    [http://NMAH/]ark:/NAAN/Name[Qualifier]

439	   the part of the ARK directly following the "ark:" is the Name
440	   Assigning Authority Number (NAAN) enclosed in `/' (slash) characters.
441	   This part is always required, as it identifies the organization that
442	   originally assigned the Name of the object.  It is used to discover a
443	   currently valid NMAH and to provide top-level partitioning of the
444	   space of all ARKs.  NAANs are registered in a manner similar to URN
445	   Namespaces, but they are pure numbers consisting of 5 digits or 9
446	   digits.  Thus, the first 100,000 registered NAAs fit compactly into
447	   the 5 digits, and if growth warrants, the next billion fit into the 9
448	   digit form.  In either case the fixed odd numbers of digits helps
449	   reduce the chances of finding a NAAN out of context and confusing it
450	   with nearby quantities such as 4-digit dates.

452	   The NAAN designates a top-level ARK namespace.  Once registered for a
453	   namespace, a NAAN is never re-registered.  It is possible, however,
454	   for there to be a succession of organizations that manage of an ARK
455	   namespace.

457	2.4.  The Name Part

459	   The part of the ARK just after the NAAN is the Name assigned by the
460	   NAA, and it is also required.  Semantic opaqueness in the Name part
461	   is strongly encouraged in order to reduce an ARK's vulnerability to
462	   era- and language-specific change.  Identifier strings containing
463	   linguistic fragments can create support difficulties down the road.
464	   No matter how appropriate or even meaningless they are today, such
465	   fragments may one day create confusion, give offense, or infringe on
466	   a trademark as the semantic environment around us and our communities
467	   evolves.

469	   Names that look more or less like numbers avoid common problems that
470	   defeat persistence and international acceptance.  The use of digits
471	   is highly recommended.  Mixing in non-vowel alphabetic characters a
472	   couple at a time is a relatively safe and easy way to achieve a
473	   denser namespace (more possible names for a given length of the name
474	   string).  Such names have a chance of aging and traveling well.
475	   Tools exists that mint, bind, and resolve opaque identifiers, with or
476	   without check characters [NOID].  More on naming considerations is
477	   given in a subsequent section.

479	2.5.  The Qualifier Part

481	   The part of the ARK following the NAA-assigned Name is an optional
482	   Qualifier.  It is a string that extends the base ARK in order to
483	   create a kind of service entry point into the object named by the
484	   NAA.  At the discretion of the providing NMA, such a service entry
485	   point permits an ARK to support access to individual hierarchical
486	   components and subcomponents of an object, and to variants (versions,
487	   languages, formats) of components.  A Qualifier may be invented by
488	   the NAA or by any NMA servicing the object.

490	   In form, the Qualifier is a ComponentPath, or a VariantPath, or a
491	   ComponentPath followed by a VariantPath.  A VariantPath is introduced
492	   and subdivided by the reserved character `.', and a ComponentPath is
493	   introduced and subdivided by the reserved character `/'.  In this
494	   example,

496	         http://foobar.zaf.org/ark:/12025/654xz321/s3/f8.05v.tiff

498	   the string "/s3/f8" is a ComponentPath and the string ".05v.tiff" is
499	   a VariantPath.  The ARK Qualifier is a formalization of some
500	   currently mainstream URL syntax conventions.  This formalization
501	   specifically reserves meanings that permit recipients to make strong
502	   inferences about logical sub-object containment and equivalence based
503	   only on the form of the received identifiers; there is great
504	   efficiency in not having to inspect metadata records to discover such
505	   relationships.  NMAs are free not to disclose any of these
506	   relationships merely by avoiding the reserved characters above.
507	   Hierarchical components and variants are discussed further in the
508	   next two sections.

510	   The Qualifier, if present, differs from the Name in several important
511	   respects.  First, a Qualifier may have been assigned either by the
512	   NAA or later by the NMA.  The assignment of a Qualifier by an NMA
513	   effectively amounts to an act of publishing a service entry point
514	   within the conceptual object originally named by the NAA.  For our
515	   purposes, an ARK extended with a Qualifier assigned by an NMA will be
516	   called an NMA-qualified ARK.

518	   Second, a Qualifier assignment on the part of an NMA is made in
519	   fulfillment of its service obligations and may reflect changing
520	   service expectations and technology requirements.  NMA-qualified ARKs
521	   could therefore be transient, even if the base, unqualified ARK is
522	   persistent.  For example, it would be reasonable for an NMA to
523	   support access to an image object through an actionable ARK that is
524	   considered persistent even if the experience of that access changes
525	   as linking, labeling, and presentation conventions evolve and as
526	   format and security standards are updated.  For an image "thumbnail",
527	   that NMA could also support an NMA-qualified ARK that is considered
528	   impersistent because the thumbnail will be replaced with higher
529	   resolution images as network bandwidth and CPU speeds increase.  At
530	   the same time, for an originally scanned, high-resolution master, the
531	   NMA could publish an NMA-qualfied ARK that is itself considered
532	   persistent.  Of course, the NMA must be able to return its separate
533	   commitments to unqualified, NAA-assigned ARKs, to NMA-qualified ARKs,
534	   and to any NAA-qualified ARKs that it supports.

536	   A third difference between a Qualifier and a Name concerns the
537	   semantic opaqueness constraint.  When an NMA-qualified ARK is to be
538	   used as a transient service entry point into a persistent object, the
539	   priority given to semantic opaqueness observed by the NAA in the Name
540	   part may be relaxed by the NMA in the Qualifier part.  If service
541	   priorities in the Qualifier take precedence over persistence, short-
542	   term usability considerations may recommend somewhat semantically
543	   laden Qualifier strings.

545	   Finally, not only is the set of Qualifiers supported by an NMA
546	   mutable, but different NMAs may support different Qualifier sets for
547	   the same NAA-identified object.  In this regard the NMAs act
548	   independently of each other and of the NAA.

550	   The next two sections describe how ARK syntax may be used to declare,
551	   or to avoid declaring, certain kinds of relatedness among qualified
552	   ARKs.

554	2.5.1.  ARKs that Reveal Object Hierarchy

556	   An NAA or NMA may choose to reveal the presence of a hierarchical
557	   relationship between objects using the `/' (slash) character after
558	   the Name part of an ARK.  Some authorities will choose not to
559	   disclose this information, while others will go ahead and disclose so
560	   that manipulators of large sets of ARKs can infer object
561	   relationships by simple identifier inspection; for example, this
562	   makes it possible for a system to present a collapsed view of a large
563	   search result set.

565	   If the ARK contains an internal slash after the NAAN, the piece to
566	   its left indicates a containing object.  For example, publishing an
567	   ARK of the form,

569	                         ark:/12025/654/xz/321

571	   is equivalent to publishing three ARKs,

573	                         ark:/12025/654/xz/321
574	                         ark:/12025/654/xz
575	                         ark:/12025/654

577	   together with a declaration that the first object is contained in the
578	   second object, and that the second object is contained in the third.

580	   Revealing the presence of hierarchy is completely up to the assigner
581	   (NMA or NAA).  It is hard enough to commit to one object's name, let
582	   alone to three objects' names and to a specific, ongoing relatedness
583	   among them.  Thus, regardless of whether hierarchy was present
584	   initially, the assigner, by not using slashes, reveals no shared
585	   inferences about hierarchical or other inter-relatedness in the
586	   following ARKs:

588	                         ark:/12025/654_xz_321
589	                         ark:/12025/654_xz
590	                         ark:/12025/654xz321
591	                         ark:/12025/654xz
592	                         ark:/12025/654

594	   Note that slashes around the ARK's NAAN (/12025/ in these examples)
595	   are not part of the ARK's Name and therefore do not indicate the
596	   existence of some sort of NAAN super object containing all objects in
597	   its namespace.  A slash must have at least one non-structural
598	   character (one that is neither a slash nor a period) on both sides in
599	   order for it to separate recognizable structural components.  So
600	   initial or final slashes may be removed, and double slashes may be
601	   converted into single slashes.

603	2.5.2.  ARKs that Reveal Object Variants

605	   An NAA or NMA may choose to reveal the possible presence of variant
606	   objects or object components using the `.' (period) character after
607	   the Name part of an ARK.  Some authorities will choose not to
608	   disclose this information, while others will go ahead and disclose so
609	   that manipulators of large sets of ARKs can infer object
610	   relationships by simple identifier inspection; for example, this
611	   makes it possible for a system to present a collapsed view of a large
612	   search result set.

614	   If the ARK contains an internal period after Name, the piece to its
615	   left is a base name and the piece to its right, and up to the end of
616	   the ARK or to the next period is a suffix.  A Name may have more than
617	   one suffix, for example,
618	                         ark:/12025/654.24
619	                         ark:/12025/xz4/654.24
620	                         ark:/12025/654.20v.78g.f55

622	   There are two main rules.  First, if two ARKs share the same base
623	   name but have different suffixes, the corresponding objects were
624	   considered variants of each other (different formats, languages,
625	   versions, etc.) by the assigner (NMA or NAA).  Thus, the following
626	   ARKs are variants of each other:

628	                         ark:/12025/654.20v.78g.f55
629	                         ark:/12025/654.321xz
630	                         ark:/12025/654.44

632	   Second, publishing an ARK with a suffix implies the existence of at
633	   least one variant identified by the ARK without its suffix.  The ARK
634	   otherwise permits no further assumptions about what variants might
635	   exist.  So publishing the ARK,

637	                         ark:/12025/654.20v.78g.f55

639	   is equivalent to publishing the four ARKs,

641	                         ark:/12025/654.20v.78g.f55
642	                         ark:/12025/654.20v.78g
643	                         ark:/12025/654.20v
644	                         ark:/12025/654

646	   Revealing the possibility of variants is completely up to the
647	   assigner.  It is hard enough to commit to one object's name, let
648	   alone to multiple variants' names and to a specific, ongoing
649	   relatedness among them.  The assigner is the sole arbiter of what
650	   constitutes a variant within its namespace, and whether to reveal
651	   that kind of relatedness by using periods within its names.

653	   A period must have at least one non-structural character (one that is
654	   neither a slash nor a period) on both sides in order for it to
655	   separate recognizable structural components.  So initial or final
656	   periods may be removed, and adjacent periods may be converted into a
657	   single period.  Multiple suffixes should be arranged in sorted order
658	   (pure ASCII collating sequence) at the end of an ARK.

660	2.6.  Character Repertoires

662	   The Name and Qualifier parts are strings of visible ASCII characters
663	   and should be less than 128 bytes in length.  The length restriction
664	   keeps the ARK short enough to append ordinary ARK request strings
665	   without running into transport restrictions (e.g., within HTTP GET
666	   requests).  Characters may be letters, digits, or any of these six
667	   characters:

669	         =   #   *   +   @   _   $

671	   The following characters may also be used, but their meanings are
672	   reserved:

674	         %   -   .   /

676	   The characters `/' and `.' are ignored if either appears as the last
677	   character of an ARK.  If used internally, they allow a name assigner
678	   to reveal object hierarchy and object variants as previously
679	   described.

681	   Hyphens are considered to be insignificant and are always ignored in
682	   ARKs.  A `-' (hyphen) may appear in an ARK for readability, or it may
683	   have crept in during the formatting and wrapping of text, but it must
684	   be ignored in lexical comparisons.  As in a telephone number, hyphens
685	   have no meaning in an ARK.  It is always safe for an NMA that
686	   receives an ARK to remove any hyphens found in it.  As a result, like
687	   the NMAH, hyphens are "identity inert" in comparing ARKs for
688	   equivalence.  For example, the following ARKs are equivalent for
689	   purposes of comparison and ARK service access:

691	                                 ark:/12025/65-4-xz-321
692	         http://sneezy.dopey.com/ark:/12025/654--xz32-1
693	                                 ark:/12025/654xz321

695	   The `%' character is reserved for %-encoding all other octets that
696	   would appear in the ARK string, in the same manner as for URIs [URI].
697	   A %-encoded octet consists of a `%' followed by two hex digits; for
698	   example, "%7d" stands in for `}'.  Lower case hex digits are
699	   preferred to reduce the chances of false acronym recognition; thus it
700	   is better to use "%acT" instead of "%ACT".  The character `%' itself
701	   must be represented using "%25".  As with URNs, %-encoding permits
702	   ARKs to support legacy namespaces (e.g., ISBN, ISSN, SICI) that have
703	   less restricted character repertoires [URNBIB].

705	2.7.  Normalization and Lexical Equivalence

707	   To determine if two or more ARKs identify the same object, the ARKs
708	   are compared for lexical equivalence after first being normalized.
709	   Since ARK strings may appear in various forms (e.g., having different
710	   NMAHs), normalizing them minimizes the chances that comparing two ARK
711	   strings for equality will fail unless they actually identify
712	   different objects.  In a specified-host ARK (one having an NMAH), the
713	   NMAH never participates in such comparisons.

715	   Normalization of an ARK for the purpose of octet-by-octet equality
716	   comparison with another ARK consists of four steps.  First, any upper
717	   case letters in the "ark:" label and the two characters following a
718	   `%' are converted to lower case.  The case of all other letters in
719	   the ARK string must be preserved.  Second, any NMAH part is removed
720	   (everything from an initial "http://" up to the next slash) and all
721	   hyphens are removed.

723	   Third, structural characters (slash and period) are normalized.
724	   Initial and final occurrences are removed, and two structural
725	   characters in a row (e.g., // or ./) are replaced by the first
726	   character, iterating until each occurrence has at least one non-
727	   structural character on either side.  Finally, if there are any
728	   components with a period on the left and a slash on the right, either
729	   the component and the preceding period must be moved to the end of
730	   the Name part or the ARK must be thrown out as malformed.

732	   The fourth and final step is to arrange the suffixes in ASCII
733	   collating sequence (that is, to sort them) and to remove duplicate
734	   suffixes, if any.  It is also permissible to throw out ARKs for which
735	   the suffixes are not sorted.

737	   The resulting ARK string is now normalized.  Comparisons between
738	   normalized ARKs are case-sensitive, meaning that upper case letters
739	   are considered different from their lower case counterparts.

741	   To keep ARK string variation to a minimum, no reserved ARK characters
742	   should be %-encoded unless it is deliberately to conceal their
743	   reserved meanings.  No non-reserved ARK characters should ever be
744	   %-encoded.  Finally, no %-encoded character should ever appear in an
745	   ARK in its decoded form.

747	3.  Naming Considerations

749	   The most important threats faced by persistence providers include
750	   such things as funding loss, natural disaster, political and social
751	   upheaval, processing faults, and errors in human oversight.  There is
752	   nothing that an identifer scheme can do about such things.  Still, a
753	   few observed identifier failures and inconveniences can be traced
754	   back to naming practices that we now know to be less than optimal for
755	   persistence.

757	3.1.  ARKS Embedded in Language

759	   The ARK has different goals from the URI, so it has different
760	   character set requirements.  Because linguistic constructs imperil
761	   persistence, for ARKs non-ASCII character support is unimportant.
762	   ARKs and URIs share goals of transcribability and transportability
763	   within web documents, so characters are required to be visible, non-
764	   conflicting with HTML/XML syntax, and not subject to tampering during
765	   transmission across common transport gateways.  Add the goal of
766	   making an undelimited ARK recognizable in running prose, as in
767	   ark:/12025/=@_22*$, and certain punctuation characters (e.g., comma,
768	   period) end up being excluded from the ARK lest the end of a phrase
769	   or sentence be mistaken for part of the ARK.

771	   This consideration has more direct effect on ARK usability in a
772	   natural language context than it has on ARK persistence.  The same is
773	   true of the rule preventing hyphens from having lexical significance.
774	   It is fine to publish ARKs with hyphens in them (e.g., such as the
775	   output of UUID/GUID generators), but the uniform treatment of hyphens
776	   as insignificant reduces the possibility of users transcribing
777	   identifiers that will have been broken through unpredictable
778	   hyphenation by word processors.  Any measure that reduces user
779	   irritation with an identifier will increase its chances of survival.

781	3.2.  Objects Should Wear Their Identifiers

783	   A valuable technique for provision of persistent objects is to try to
784	   arrange for the complete identifier to appear on, with, or near its
785	   retrieved object.  An object encountered at a moment in time when its
786	   discovery context has long since disappeared could then easily be
787	   traced back to its metadata, to alternate versions, to updates, etc.
788	   This has seen reasonable success, for example, in book publishing and
789	   software distribution.  An identifier string only has meaning when
790	   its association is known, and this a very sure, simple, and low-tech
791	   method of reminding everyone exactly what that association is.

793	3.3.  Names are Political, not Technological

795	   If persistence is the goal, a deliberate local strategy for
796	   systematic name assignment is crucial.  Names must be chosen with
797	   great care.  Poorly chosen and managed names will devastate any
798	   persistence strategy, and they do not discriminate by identifier
799	   scheme.  Whether a mistakenly re-assigned name is a URN, DOI, PURL,
800	   URL, or ARK, the damage - failed access and confusion - is not
801	   mitigated more in one scheme than in another.  Conversely, in-house
802	   efforts to manage names responsibly will go much further towards
803	   safeguarding persistence than any choice of naming scheme or name
804	   resolution technology.

806	   Branding (e.g., at the corporate or departmental level) is important
807	   for funding and visibility, but substrings representing brands and
808	   organizational names should be given a wide berth except when
809	   absolutely necessary in the hostname (the identity-inert) part of the
810	   ARK.  These substrings are not only unstable because organizations
811	   change frequently, but they are also dangerous because successor
812	   organizations often have political or legal reasons to actively
813	   suppress predecessor names and brands.  Any measure that reduces the
814	   chances of future political or legal pressure on an identifier will
815	   decrease the chances that our descendants will be obliged to
816	   deliberately break it.

818	3.4.  Choosing a Hostname or NMA

820	   Hostnames appearing in any identifier meant to be persistent must be
821	   chosen with extra care.  The tendency in hostname selection has
822	   traditionally been to choose a token with recognizable attributes,
823	   such as a corporate brand, but that tendency wreaks havoc with
824	   persistence that is supposed to outlive brands, corporations, subject
825	   classifications, and natural language semantics (e.g., what did the
826	   three letters "gay" mean in 1958, 1978, and 1998?).  Today's
827	   recognized and correct attributes are tomorrow's stale or incorrect
828	   attributes.  In making hostnames (any names, actually) long-term
829	   persistent, it helps to eliminate recognizable attributes to the
830	   extent possible.  This affects selection of any name based on URLs,
831	   including PURLs and the explicitly disposable NMAHs.

833	   There is no excuse for a provider that manages its internal names
834	   impeccably not to exercise the same care in choosing what could be an
835	   exceptionally durable hostname, especially if it would form the
836	   prefix for all the provider's URL-based external names.  Registering
837	   an opaque hostname in the ".org" or ".net" domain would not be a bad
838	   start.  Another way is to publish your ARKs with an organizational
839	   domain name that will be mapped by DNS to an appropriate NMA host.
840	   This makes for shorter names with less branding vulnerability.

842	   It is a mistake to think that hostnames are inherently unstable.  If
843	   you require brand visibility, that may be a fact of life.  But things
844	   are easier if yours is the brand of long-lived cultural memory
845	   institution such as a national or university library or archive.
846	   Well-chosen hostnames from organizations that are sheltered from the
847	   direct effects of a volatile marketplace can easily provide longer-
848	   lived global resolvers than the domain names explicitly or implicitly
849	   used as starting points for global resolution by indirection-based
850	   persistent identifier schemes.  For example, it is hard to imagine
851	   circumstances under which the Library of Congress' domain name would
852	   disappear sooner than, say, "handle.net".

854	   For smaller libraries, archives, and preservation organizations,
855	   there is a natural concern about whether they will be able to keep
856	   their web servers and domain names in the face of uncertain funding.
857	   One option is to form or join a consortium [N2T] of like-minded
858	   organizations with the purpose of providing mutual preservation
859	   support.  The first goal of such a consortium would be to perpetually
860	   rent a hostname on which to establish a web server that simply
861	   redirects incoming member organization requests to the appropriate
862	   member server; using ARKs, for example, a 150-member consortium could
863	   run a very small server (24x7) that contained nothing more than 150
864	   rewrite rules in its configuration file.  Even more helpful would be
865	   additional consortial support for a member organization that was
866	   unable to continue providing services and needed to find a successor
867	   archival organization.  This would be a low-cost, low-tech way to
868	   publish ARKs (or URLs) under highly persistent hostnames.

870	   There are no obvious reasons why the organizations registering DNS
871	   names, URN Namespaces, and DOI publisher IDs should have among them
872	   one that is intrinsically more fallible than the next.  Moreover, it
873	   is a misconception that the demise of DNS and of HTTP need adversely
874	   affect the persistence of URLs.  At such a time, certainly URLs from
875	   the present day might not then be actionable by our present-day
876	   mechanisms, but resolution systems for future non-actionable URLs are
877	   no harder to imagine than resolution systems for present-day non-
878	   actionable URNs and DOIs.  There is no more stable a namespace than
879	   one that is dead and frozen, and that would then characterize the
880	   space of names bearing the "http://" prefix.  It is useful to
881	   remember that just because hostnames have been carelessly chosen in
882	   their brief history does not mean that they are unsuitable in NMAHs
883	   (and URLs) intended for use in situations demanding the highest level
884	   of persistence available in the Internet environment.  A well-planned
885	   name assignment strategy is everything.

887	3.5.  Assigners of ARKs

889	   A Name Assigning Authority (NAA) is an organization that creates (or
890	   delegates creation of) long-term associations between identifiers and
891	   information objects.  Examples of NAAs include national libraries,
892	   national archives, and publishers.  An NAA may arrange with an
893	   external organization for identifier assignment.  The US Library of
894	   Congress, for example, allows OCLC (the Online Computer Library
895	   Center, a major world cataloger of books) to create associations
896	   between Library of Congress call numbers (LCCNs) and the books that
897	   OCLC processes.  A cataloging record is generated that testifies to
898	   each association, and the identifier is included by the publisher,
899	   for example, in the front matter of a book.

901	   An NAA does not so much create an identifier as create an
902	   association.  The NAA first draws an unused identifier string from
903	   its namespace, which is the set of all identifiers under its control.
904	   It then records the assignment of the identifier to an information
905	   object having sundry witnessed characteristics, such as a particular
906	   author and modification date.  A namespace is usually reserved for an
907	   NAA by agreement with recognized community organizations (such as
908	   IANA and ISO) that all names containing a particular string be under
909	   its control.  In the ARK an NAA is represented by the Name Assigning
910	   Authority Number (NAAN).

912	   The ARK namespace reserved for an NAA is the set of names bearing its
913	   particular NAAN.  For example, all strings beginning with
914	   "ark:/12025/" are under control of the NAA registered under 12025,
915	   which might be the National Library of Finland.  Because each NAA has
916	   a different NAAN, names from one namespace cannot conflict with those
917	   from another.  Each NAA is free to assign names from its namespace
918	   (or delegate assignment) according to its own policies.  These
919	   policies must be documented in a manner similar to the declarations
920	   required for URN Namespace registration [URNNID].

922	   To register for a NAAN, please read about the mapping authority
923	   discovery file in the next section and send email to ark@cdlib.org.

925	3.6.  NAAN Namespace Management

927	   Every NAA must have a namespace management strategy.  A time-honored
928	   technique is to hierarchically partition a namespace into
929	   subnamespaces using prefixes that guarantee non-collision of names in
930	   different partition.  This practice is strongly encouraged for all
931	   NAAs, especially when subnamespace management will be delegated to
932	   other departments, units, or projects within an organization.  For
933	   example, with a NAAN that is assigned to a university and managed by
934	   its main library, care should be taken to reserve semantically opaque
935	   prefixes that will set aside large parts of the unused namespace for
936	   future assignments.  Prefix-based partition management is an
937	   important responsibility of the NAA.

939	   This sort of delegation by prefix is well-used in the formation of
940	   DNS names and ISBN identifiers.  An important difference is that in
941	   the former, the hierarchy is deliberately exposed and in the latter
942	   it is hidden.  Rather than using lexical boundary markers such as the
943	   period (`.') found in domain names, the ISBN uses a publisher prefix
944	   but doesn't disclose where the prefix ends and the publisher's
945	   assigned name begins.  This practice of non-disclosure, borrowed from
946	   the ISBN and ISSN schemes, is encouraged in assigning ARKs, because
947	   it reduces the visibility of an assertion that is probably not
948	   important now and may become a vulnerability later.

950	   Reasonable prefixes for assigned names usually consist of consonants
951	   and digits and are 1-5 characters in length.  For example, the
952	   constant prefix "x9t" might be delegated to a book digitization
953	   project that creates identifiers such as

955	             http://444.berkeley.edu/ark:/28722/x9t38rk45c

957	   If longevity is the goal, it is important to keep the prefixes free
958	   of recognizable semantics; for example, using an acronym representing
959	   a project or a department is discouraged.  At the same time, you may
960	   wish to set aside a subnamespace for testing purposes under a prefix
961	   such as "fk..." that can serve as a visual clue and reminder to
962	   maintenance staff that this "fake" identifier was never published.

964	   There are other measures one can take to avoid user confusion,
965	   transcription errors, and the appearance of accidental semantics when
966	   creating identifiers.  If you are generating identifiers
967	   automatically, pure numeric identifiers are likeley to be
968	   semantically opaque enough, but it's probably useful to avoid leading
969	   zeroes because some users mistakenly treat them as optional, thinking
970	   (arithmetically) that they don't contribute to the "value" of the
971	   identifier.

973	   If you need lots of identifiers and you don't want them to get too
974	   long, you can mix digits with consonants (but avoid vowels since they
975	   might accidentally spell words) to get more identifiers without
976	   increasing the string length.  In this case you may not want more
977	   than a two letters in a row because it reduces the chance of
978	   generating acronyms.  Generator tools such as [NOID] provide support
979	   for these sorts of identifiers, and can also add a computed check
980	   character as a guarantee against the most common transcription
981	   errors.

983	3.7.  Sub-Object Naming

985	   As mentioned previously, semantically opaque identifiers are very
986	   useful for long-term naming of abstract objects, however, it may be
987	   appropriate to extend these names with less opaque extensions that
988	   reference contemporary service entry points (sub-objects) in support
989	   of the object.  Sub-object extensions beginning with a digit or
990	   underscore (`_') are reserved for the possibilty of developing a
991	   future registry of canonical service points (e.g., numeric references
992	   to versions, formats, languages, etc).

994	4.  Finding a Name Mapping Authority

996	   In order to derive an actionable identifier (these days, a URL) from
997	   an ARK, a hostport (hostname or hostname plus port combination) for a
998	   working Name Mapping Authority (NMA) must be found.  An NMA is a
999	   service that is able to respond to the three basic ARK service
1000	   requests.  Relying on registration and client-side discovery, NMAs
1001	   make known which NAAs' identifiers they are willing to service.

1003	   Upon encountering an ARK, a user (or client software) looks inside it
1004	   for the optional NMAH part (the hostport of the NMA's ARK service).
1005	   If it contains an NMAH that is working, this NMAH discovery step may
1006	   be skipped; the NMAH effectively uses the beginning of an ARK to
1007	   cache the results of a prior mapping authority discovery process.  If
1008	   a new NMAH needs to found, the client looks inside the ARK again for
1009	   the NAAN (Name Assigning Authority Number).  Querying a global
1010	   database, it then uses the NAAN to look up all current NMAHs that
1011	   service ARKs issued by the identified NAA.  The global database is
1012	   key, and two specific methods for querying it are given in this
1013	   section.

1015	   A third very promising method, called the Name-to-Thing [N2T]
1016	   Resolver, is being explored.  It is a low-cost, highly stable,
1017	   consortially maintained NMAH that simply exists to support actionable
1018	   HTTP-based URLs for as long as HTTP is used.  One of its big
1019	   advantages over the other two methods and the URN, Handle, DOI, and
1020	   PURL methods, is that N2T addresses the namespace splitting problem.
1021	   When objects maintained by one NMA are inherited by more than one
1022	   successor NMA, until now one of those successors would be required to
1023	   maintain forwarding tables on behalf of the other successors.

1025	   In the interests of long-term persistence, however, ARK mechanisms
1026	   are first defined in high-level, protocol-independent terms so that
1027	   mechanisms may evolve and be replaced over time without compromising
1028	   fundamental service objectives.  Either or both specific methods
1029	   given here may eventually be supplanted by better methods since, by
1030	   design, the ARK scheme does not depend on a particular method, but
1031	   only on having some method to locate an active NMAH.

1033	   At the time of issuance, at least one NMAH for an ARK should be
1034	   prepared to service it.  That NMA may or may not be administered by
1035	   the Name Assigning Authority (NAA) that created it.  Consider the
1036	   following hypothetical example of providing long-term access to a
1037	   cancer research journal.  The publisher wishes to turn a profit and
1038	   the National Library of Medicine wishes to preserve the scholarly
1039	   record.  An agreement might be struck whereby the publisher would act
1040	   as the NAA and the national library would archive the journal issue
1041	   when it appears, but without providing direct access for the first
1042	   six months.  During the first six months of peak commercial
1043	   viability, the publisher would retain exclusive delivery rights and
1044	   would charge access fees.  Again, by agreement, both the library and
1045	   the publisher would act as NMAs, but during that initial period the
1046	   library would redirect requests for issues less than six months old
1047	   to the publisher.  At the end of the waiting period, the library
1048	   would then begin servicing requests for issues older than six months
1049	   by tapping directly into its own archives.  Meanwhile, the publisher
1050	   might routinely redirect incoming requests for older issues to the
1051	   library.  Long-term access is thereby preserved, and so is the
1052	   commercial incentive to publish content.

1054	   Although it will be common for an NAA also to run an NMA service, it
1055	   is never a requirement.  Over time NAAs and NMAs will come and go.
1056	   One NMA will succeed another, and there might be many NMAs serving
1057	   the same ARKs simultaneously (e.g., as mirrors or as competitors).
1058	   There might also be asymmetric but coordinated NMAs as in the
1059	   library-publisher example above.

1061	4.1.  Looking Up NMAHs in a Globally Accessible File

1063	   This subsection describes a way to look up NMAHs using a simple name
1064	   authority table represented as a plain text file.  For efficient
1065	   access the file may be stored in a local filesystem, but it needs to
1066	   be reloaded periodically to incorporate updates.  It is not expected
1067	   that the size of the file or frequency of update should impose an
1068	   undue maintenance or searching burden any time soon, for even
1069	   primitive linear search of a file with ten-thousand NAAs is a
1070	   subsecond operation on modern server machines.  The proposed file
1071	   strategy is similar to the /etc/hosts file strategy that supported
1072	   Internet host address lookup for a period of years before the advent
1073	   of DNS.

1075	   The name authority table file is updated on an ongoing basis and is
1076	   available for copying over the internet from the California Digital
1077	   Library at http://www.cdlib.org/inside/diglib/ark/natab and from a
1078	   number of mirror sites.  The file contains comment lines (lines that
1079	   begin with `#') explaining the format and giving the file's
1080	   modification time, reloading address, and NAA registration
1081	   instructions.  There is even a Perl script that processes the file
1082	   embedded in the file's comments.  As of February 2006, currently
1083	   registered Name Assigning Authorities are:

1085	        12025            National Library of Medicine
1086	        12026            Library of Congress
1087	        12027            National Agriculture Library
1088	        13030            California Digital Library
1089	        13038            World Intellectual Property Organization
1090	        20775            University of California San Diego
1091	        29114            University of California San Francisco
1092	        28722            University of California Berkeley
1093	        21198            University of California Los Angeles
1094	        15230            Rutgers University
1095	        13960            Internet Archive
1096	        64269            Digital Curation Centre
1097	        62624            New York University
1098	        67531            University of North Texas
1099	        27927            Ithaka Electronic-Archiving Initiative
1100	        12148            Bibliotheque nationale de France / National Library of France
1101	        78319            Google
1102	        88435            Princeton University
1103	        78428            University of Washington
1104	        89901            Archives of Region of Vastra Gotaland and City of Gothenburg, Sweden
1105	        80444            Northwest Digital Archives
1106	        25593            Emory University
1107	        25031            University of Kansas
1108	        17101            Centre for Ecology & Hydrology, UK
1109	        65323            University of Calgary

1111	   A snapshot of the name authority table file appears in an appendix.

1113	4.2.  Looking up NMAHs Distributed via DNS

1115	   This subsection introduces a method for looking up NMAHs that is
1116	   based on the method for discovering URN resolvers described in
1117	   [NAPTR].  It relies on querying the DNS system already installed in
1118	   the background infrastructure of most networked computers.  A query
1119	   is submitted to DNS asking for a list of resolvers that match a given
1120	   NAAN.  DNS distributes the query to the particular DNS servers that
1121	   can best provide the answer, unless the answer can be found more
1122	   quickly in a local DNS cache as a side-effect of a recent query.
1123	   Responses come back inside Name Authority Pointer (NAPTR) records.
1124	   The normal result is one or more candidate NMAHs.

1126	   In its full generality the [NAPTR] algorithm ambitiously accommodates
1127	   a complex set of preferences, orderings, protocols, mapping services,
1128	   regular expression rewriting rules, and DNS record types.  This
1129	   subsection proposes a drastic simplification of it for the special
1130	   case of ARK mapping authority discovery.  The simplified algorithm is
1131	   called Maptr.  It uses only one DNS record type (NAPTR) and restricts
1132	   most of its field values to constants.  The following hypothetical
1133	   excerpt from a DNS data file for the NAAN known as 12026 shows three
1134	   example NAPTR records ready to use with the Maptr algorithm.

1136	       12026.ark.arpa.
1137	       ;; US Library of Congress
1138	       ;;       order pref flags service regexp  replacement
1139	        IN NAPTR  0     0   "h"  "ark"   "USLC"  lhc.nlm.nih.gov:8080
1140	        IN NAPTR  0     0   "h"  "ark"   "USLC"  foobar.zaf.org
1141	        IN NAPTR  0     0   "h"  "ark"   "USLC"  sneezy.dopey.com

1143	   All the fields are held constant for Maptr except for the "flags",
1144	   "regexp", and "replacement" fields.  The "service" field contains the
1145	   constant value "ark" so that NAPTR records participating in the Maptr
1146	   algorithm will not be confused with other NAPTR records.  The "order"
1147	   and "pref" fields are held to 0 (zero) and otherwise ignored for now;
1148	   the algorithm may evolve to use these fields for ranking decisions
1149	   when usage patterns and local administrative needs are better
1150	   understood.

1152	   When a Maptr query returns a record with a flags field of "h" (for
1153	   hostport, a Maptr extension to the NAPTR flags), the replacement
1154	   field contains the NMAH (hostport) of an ARK service provider.  When
1155	   a query returns a record with a flags field of "" (the empty string),
1156	   the client needs to submit a new query containing the domain name
1157	   found in the replacement field.  This second sort of record exploits
1158	   the distributed nature of DNS by redirecting the query to another
1159	   domain name.  It looks like this.

1161	       12345.ark.arpa.
1162	       ;; Digital Library Consortium
1163	       ;;       order pref flags service regexp replacement
1164	        IN NAPTR  0     0    ""  "ark"     ""   dlc.spct.org.

1166	   Here is the Maptr algorithm for ARK mapping authority discovery.  In
1167	   it replace <NAAN> with the NAAN from the ARK for which an NMAH is
1168	   sought.

1170	        (1) Initialize the DNS query:  type=NAPTR,
1171	        query=<NAAN>.ark.arpa.

1173	        (2) Submit the query to DNS and retrieve (NAPTR) records,
1174	        discarding any record that does not have "ark" for the service
1175	        field.

1177	        (3) All remaining records with a flags fields of "h" contain
1178	        candidate NMAHs in their replacement fields.  Set them aside, if
1179	        any.

1181	        (4) Any record with an empty flags field ("") has a replacement
1182	        field containing a new domain name to which a subsequent query
1183	        should be redirected.  For each such record, set
1184	        query=<replacement> then go to step (2).  When all such records
1185	        have been recursively exhausted, go to step (5).

1187	        (5) All redirected queries have been resolved and a set of
1188	        candidate NMAHs has been accumulated from steps (3).  If there
1189	        are zero NMAHs, exit - no mapping authority was found.  If there
1190	        is one or more NMAH, choose one using any criteria you wish,
1191	        then exit.

1193	   A Perl script that implements this algorithm is included here.

1195	     #!/depot/bin/perl

1197	     use Net::DNS;                 # include simple DNS package
1198	     my $qtype = "NAPTR";               # initialize query type
1199	     my $naa = shift;              # get NAAN script argument
1200	     my $mad = new Net::DNS::Resolver;  # mapping authority discovery

1202	     &maptr("$naa.ark.arpa");      # call maptr - that's it

1204	     sub maptr {                   # recursive maptr algorithm
1205	          my $dname = shift;       # domain name as argument
1206	          my ($rr, $order, $pref, $flags, $service, $regexp,
1207	               $replacement);
1208	          my $query = $mad->query($dname, $qtype);
1209	          return                   # non-productive query
1210	               if (! $query || ! $query->answer);
1211	          foreach $rr ($query->answer) {
1212	               next           # skip records of wrong type
1213	                    if ($rr->type ne $qtype);
1214	               ($order, $pref, $flags, $service, $regexp,
1215	                    $replacement) = split(/\s/, $rr->rdatastr);
1216	               if ($flags eq "") {
1217	                    &maptr($replacement);    # recurse
1218	               } elsif ($flags eq "h") {
1219	                    print "$replacement\n";  # candidate NMAH
1220	               }
1221	          }
1222	     }

1224	   The global database thus distributed via DNS and the Maptr algorithm
1225	   can easily be seen to mirror the contents of the Name Authority Table
1226	   file described in the previous section.

1228	5.  Generic ARK Service Definition

1230	   An ARK request's output is delivered information; examples include
1231	   the object itself, a policy declaration (e.g., a promise of support),
1232	   a descriptive metadata record, or an error message.  The experience
1233	   of object delivery is expected to be an evolving mix of information
1234	   that reflects changing service expectations and technology
1235	   requirements; contemporary examples include such things as an object
1236	   summary and component links formatted for human consumption.  ARK
1237	   services must be couched in high-level, protocol-independent terms if
1238	   persistence is to outlive today's networking infrastructural
1239	   assumptions.  The high-level ARK service definitions listed below are
1240	   followed in the next section by a concrete method (one of many
1241	   possible methods) for delivering these services with today's
1242	   technology.

1244	5.1.  Generic ARK Access Service (access, location)

1246	   Returns (a copy of) the object or a redirect to the same, although a
1247	   sensible object proxy may be substituted.  Examples of sensible
1248	   substitutes include,

1250	     - a table of contents instead of a large complex document,
1251	     - a home page instead of an entire web site hierarchy,
1252	     - a rights clearance challenge before accessing protected data,
1253	     - directions for access to an offline object (e.g., a book),
1254	     - a description of an intangible object (a disease, an event), or
1255	     - an applet acting as "player" for a large multimedia object.

1257	   May also return a discriminated list of alternate object locators.
1258	   If access is denied, returns an explanation of the object's current
1259	   (perhaps permanent) inaccessibility.

1261	5.2.  Generic Policy Service (permanence, naming, etc.)

1263	   Returns declarations of policy and support commitments for given
1264	   ARKs.  Declarations are returned in either a structured metadata
1265	   format or a human readable text format; sometimes one format may
1266	   serve both purposes.  Policy subareas may be addressed in separate
1267	   requests, but the following areas should should be covered:  object
1268	   permanence, object naming, object fragment addressing, and
1269	   operational service support.

1271	   The permanence declaration for an object is a rating defined with
1272	   respect to an identified permanence provider (guarantor), which will
1273	   be the NMA.  It may include the following aspects.

1275	        (a) "object availability" - whether and how access to the object
1276	        is supported (e.g., online 24x7, or offline only),

1278	        (b) "identifier validity" - under what conditions the identifier
1279	        will be or has been re-assigned,

1281	        (c) "content invariance" - under what conditions the content of
1282	        the object is subject to change, and

1284	        (d) "change history" - access to corrections, migrations, and
1285	        revisions, whether through links to the changed objects
1286	        themselves or through a document summarizing the change history

1288	   One approach to a permanence rating framework, conceived
1289	   independently from ARKs, is given in [NLMPerm].  Under ongoing
1290	   development and limited deployment at the US National Library of
1291	   Medicine, it identifies the following "permanence levels":

1293	        Not Guaranteed: No commitment has been made to retain this
1294	        resource.  It could become unavailable at any time.  Its
1295	        identifier could be changed.

1297	        Permanent: Dynamic Content: A commitment has been made to keep
1298	        this resource permanently available.  Its identifier will always
1299	        provide access to the resource.  Its content could be revised or
1300	        replaced.

1302	        Permanent: Stable Content: A commitment has been made to keep
1303	        this resource permanently available.  Its identifier will always
1304	        provide access to the resource.  Its content is subject only to
1305	        minor corrections or additions.

1307	        Permanent: Unchanging Content: A commitment has been made to
1308	        keep this resource permanently available.  Its identifier will
1309	        always provide access to the resource.  Its content will not
1310	        change.

1312	   Naming policy for an object includes an historical description of the
1313	   NAA's (and its successor NAA's) policies regarding differentiation of
1314	   objects.  Since it the NMA who responds to requests for policy
1315	   statements, it is useful for the NMA to be able to produce or
1316	   summarize these historical NAA documents.  Naming policy may include
1317	   the following aspects.

1319	        (i) "similarity" - (or "unity") the limit, defined by the NAA,
1320	        to the level of dissimilarity beyond which two similar objects
1321	        warrant separate identifiers but before which they share one
1322	        single identifier, and

1324	        (ii) "granularity" - the limit, defined by the NAA, to the level
1325	        of object subdivision beyond which sub-objects do not warrant
1326	        separately assigned identifiers but before which sub-objects are
1327	        assigned separate identifiers.

1329	   Subnaming policy for an object describes the qualifiers that the NMA,
1330	   in fulfilling its ongoing and evolving service obligations, allows as
1331	   extensions to an NAA-assigned ARK.  To the conceptual object that the
1332	   NAA named with an ARK, the NMA may add component access points and
1333	   derivatives (e.g., format migrations in aid of preservation) in order
1334	   to provide both basic and value-added services.

1336	   Addressing policy for an object includes a description of how, during
1337	   access, object components (e.g., paragraphs, sections) or views
1338	   (e.g., image conversions) may or may not be "addressed", in other
1339	   words, how the NMA permits arguments or parameters to modify the
1340	   object delivered as the result of an ARK request.  If supported,
1341	   these sorts of operations would provide things like byte-ranged
1342	   fragment delivery and open-ended format conversions, or any set of
1343	   possible transformations that would be too numerous to list or to
1344	   identify with separately assigned ARKs.

1346	   Operational service support policy includes a description of general
1347	   operational aspects of the NMA service, such as after-hours staffing
1348	   and trouble reporting procedures.

1350	5.3.  Generic Description Service

1352	   Returns a description of the object.  Descriptions are returned in
1353	   either a structured metadata format or a human readable text format;
1354	   sometimes one format may serve both purposes.  A description must at
1355	   a minimum answer the who, what, when, and where questions concerning
1356	   an expression of the object.  Standalone descriptions should be
1357	   accompanied by the modification date and source of the description
1358	   itself.  May also return discriminated lists of ARKs that are related
1359	   to the given ARK.

1361	6.  Overview of The HTTP URL Mapping Protocol (THUMP)

1363	   The HTTP URL Mapping Protocol (THUMP) is a way of taking a key (a
1364	   kind of identifier) and asking such questions as, what information
1365	   does this identify and how permanent is it?  [THUMP] is in fact one
1366	   specific method under development for delivering ARK services.  The
1367	   protocol runs over HTTP to exploit the web browser's current pre-
1368	   eminence as user interface to the Internet.  THUMP is designed so
1369	   that a person can enter ARK requests directly into the location field
1370	   of current browser interfaces.  Because it runs over HTTP, THUMP can
1371	   be simulated and tested within keyboard-based [TELNET] sessions.

1373	   The asker (a person or client program) starts with an identifier,
1374	   such as an ARK or a URL.  The identifier reveals to the asker (or
1375	   allows the asker to infer) the Internet host name and port number of
1376	   a server system that responds to questions.  Here, this is just the
1377	   NMAH that is obtained by inspection and possibly lookup based on the
1378	   ARK's NAAN.  The asker then sets up an HTTP session with the server
1379	   system, sends a question via a THUMP request (contained within an
1380	   HTTP request), receives an answer via a THUMP response (contained
1381	   within an HTTP response), and closes the session.  That concludes the
1382	   connected portion of the protocol.

1384	   A THUMP request is a string of characters beginning with a `?'
1385	   (question mark) that is appended to the identifier string.  The
1386	   resulting string is sent as an argument to HTTP's GET command.
1387	   Request strings too long for GET may be sent using HTTP's POST
1388	   command.  The three most common requests correspond to three
1389	   degenerate special cases that keep the user's learning and typing
1390	   burden low.  First, a simple key with no request at all is the same
1391	   as an ordinary access request.  Thus a plain ARK entered into a
1392	   browser's location field behaves much like a plain URL, and returns
1393	   access to the primary identified object, for instance, an HTML
1394	   document.

1396	   The second special case is a minimal ARK description request string
1397	   consisting of just "?".  For example, entering the string,

1399	             ark.nlm.nih.gov/12025/psbbantu?

1401	   into the browser's location field directly precipitates a request for
1402	   a metadata record describing the object identified by
1403	   ark:/12025/psbbantu.  The browser, unaware of THUMP, prepares and
1404	   sends an HTTP GET request in the same manner as for a URL.  THUMP is
1405	   designed so that the response (indicated by the returned HTTP content
1406	   type) is normally displayed, whether the output is structured for
1407	   machine processing (text/plain) or formatted for human consumption
1408	   (text/html).

1410	   In the following example THUMP session, each line has been annotated
1411	   to include a line number and whether it was the client or server that
1412	   sent it.  Without going into much depth, the session has four pieces
1413	   separated from each other by blank lines:  the client's piece (lines
1414	   1-3), the server's HTTP/THUMP response headers (4-7), and the body of
1415	   the server's response (8-17).  The first and last lines (1 and 17)
1416	   correspond to the client's steps to start the TCP session and the
1417	   server's steps to end it, respectively.

1419	      1  C: [opens session]
1420	         C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu? HTTP/1.1
1421	         C:
1422	         S: HTTP/1.1 200 OK
1423	      5  S: Content-Type: text/plain
1424	         S: THUMP-Status: 0.1 200 OK
1425	         S:
1426	         S: |set: NLM | 12025/psbbantu? | 20030731
1427	         S:         | http://ark.nlm.nih.gov/ark:/12025/psbbantu?
1428	     10  S: here: 1 | 1 | 1
1429	         S:
1430	         S: erc:
1431	         S: who:    Lederberg, Joshua
1432	         S: what:   Studies of Human Families for Genetic Linkage
1433	     15  S: when:   1974
1434	         S: where:  http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf
1435	         S: [closes session]

1437	   The first two server response lines (4-5) above are typical of HTTP.
1438	   The next line (6) is peculiar to THUMP, and indicates the THUMP
1439	   version and a normal return status.  The balance of the response
1440	   consists of a record set header (lines 8-10) and a single metadata
1441	   record (12-16) that comprises the ARK description service response.
1442	   The record set header identifies (8-9) who created the set, what its
1443	   title is, when it was created, and where an automated process can
1444	   access the set; it ends in a line (10) whose respective sub-elements
1445	   indicate that here in this communication the recipient can expect to
1446	   find 1 record, starting at the record numbered 1, from a set
1447	   consisting of a total of 1 record (i.e., here is the entire set,
1448	   consisting of exactly one record).

1450	   The returned record (12-16) is in the format of an Electronic
1451	   Resource Citation [ERC], which is discussed in more detail in the
1452	   next section.  For now, note that it contains four elements that
1453	   answer the top priority questions regarding an expression of the
1454	   object:  who played a major role in expressing it, what the
1455	   expression was called, when is was created, and where the expression
1456	   may be found.  This quartet of elements comes up again and again in
1457	   ERCs.

1459	   The third degenerate special case of an ARK request (and no other
1460	   cases will be described in this document) is the string "??",
1461	   corresponding to a minimal permanence policy request.  It can be seen
1462	   in use appended to an ARK (on line 2) in the example session that
1463	   follows.

1465	      1  C: [opens session]
1466	         C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu?? HTTP/1.1
1467	         C:
1468	         S: HTTP/1.1 200 OK
1469	      5  S: Content-Type: text/plain
1470	         S: THUMP-Status: 0.1 200 OK
1471	         S:
1472	         S: |set: NLM | 12025/psbbantu?? | 20030731
1473	         S:         | http://ark.nlm.nih.gov/ark:/12025/psbbantu??
1474	     10  S: here: 1 | 1 | 1
1475	         S:
1476	         S: erc:
1477	         S: who:    Lederberg, Joshua
1478	         S: what:   Studies of Human Families for Genetic Linkage
1479	     15  S: when:   1974
1480	         S: where:  http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf
1481	         S: erc-support:
1482	         S: who:    USNLM
1483	         S: what:   Permanent, Unchanging Content
1484	     20  S: when:   20010421
1485	         S: where:  http://ark.nlm.nih.gov/yy22948
1486	         S: [closes session]

1488	   Again, a single metadata record (lines 12-21) is returned, but it
1489	   consists of two segments.  The first segment (12-16) gives the same
1490	   basic citation information as in the previous example.  It is
1491	   returned in order to establish context for the persistence
1492	   declaration in the second segment (17-21).

1494	   Each segment in an ERC tells a different story relating to the
1495	   object, so although the same four questions (elements) appear in
1496	   each, the answers depend on the segment's story type.  While the
1497	   first segment tells the story of an expression of the object, the
1498	   second segment tells the story of the support commitment made to it:
1499	   who made the commitment, what the nature of the commitment was, when
1500	   it was made, and where a fuller explanation of the commitment may be
1501	   found.

1503	7.  Overview of Electronic Resource Citations (ERCs)

1505	   An Electronic Resource Citation (or ERC, pronounced e-r-c) [ERC] is a
1506	   kind of object description that uses Dublin Core Kernel [Kernel]
1507	   metadata elements.  The ERC with Kernel elements provides a simple,
1508	   compact, and printable record for holding data associated with an
1509	   information resource.  By design, Kernel metadata balances the needs
1510	   for expressive power, very simple machine processing, and direct
1511	   human manipulation.

1513	   A founding principle of Kernel metadata is that direct human contact
1514	   with metadata will be a necessary and sufficient condition for the
1515	   near term rapid development of metadata standards, systems, and
1516	   services.  Thus the machine-processable Kernel elements must only
1517	   minimally strain people's ability to read, understand, change, and
1518	   transmit ERCs without their relying on intermediation with
1519	   specialized software tools.  The basic ERC needs to be succinct,
1520	   transparent, and trivially parseable by software.

1522	   In the current Internet, it is natural seriously to consider using
1523	   XML as an exchange format because of predictions that it will obviate
1524	   many ad hoc formats and programs, and unify much of the world's
1525	   information under one reliable data structuring discipline that is
1526	   easy to generate, verify, parse, and render.  It appears, however,
1527	   that XML is still only catching on after years of standards work and
1528	   implementation experience.  The reasons for it are unclear, but for
1529	   now very simple XML interpretation is still out of reach.  Another
1530	   important caution is that XML structures are hard on the eyeballs,
1531	   taking up an amount of display (and page) space that significantly
1532	   exceeds that of traditional formats.  Until these conflicts with ERC
1533	   principle are resolved, XML is not a first choice for representing
1534	   ERCs.  Borrowing instead from the data structuring format that
1535	   underlies the successful spread of email and web services, the first
1536	   ERC format uses [ANVL], which is based on email and HTTP headers
1537	   [RFC822].  There is a naturalness to ANVL's label-colon-value format
1538	   (seen in the previous section) that barely needs explanation to a
1539	   person beginning to enter ERC metadata.

1541	   Besides simplicity of ERC system implementation and data entry
1542	   mechanics, ERC semantics (what the record and its constituent parts
1543	   mean) must also be easy to explain.  ERC semantics are based on a
1544	   reformulation and extension of the Dublin Core [DCORE] hypothesis,
1545	   which suggests that the fifteen Dublin Core metadata elements have a
1546	   key role to play in cross-domain resource description.  The ERC
1547	   design recognizes that the Dublin Core's primary contribution is the
1548	   international, interdisciplinary consensus that identified fifteen
1549	   semantic buckets (element categories), regardless of how they are
1550	   labeled.  The ERC then adds a definition for a record and some
1551	   minimal compliance rules.  In pursuing the limits of simplicity, the
1552	   ERC design combines and relabels some Dublin Core buckets to isolate
1553	   a tiny kernel (subset) of four elements for basic cross-domain
1554	   resource description.

1556	   For the cross-domain kernel, the ERC uses the four basic elements -
1557	   who, what, when, and where - to pretend that every object in the
1558	   universe can have a uniform minimal description.  Each has a name or
1559	   other identifier, a location, some responsible person or party, and a
1560	   date.  It doesn't matter what type of object it is, or whether one
1561	   plans to read it, interact with it, smoke it, wear it, or navigate
1562	   it.  Of course, this approach is flawed because uniformity of
1563	   description for some object types requires more semantic contortion
1564	   and sacrifice than for others.  That is why at the beginning of this
1565	   document, the ARK was said to be suited to objects that accommodate
1566	   reasonably regular electronic description.

1568	   While insisting on uniformity at the most basic level provides
1569	   powerful cross-domain leverage, the semantic sacrifice is great for
1570	   many applications.  So the ERC also permits a semantically rich and
1571	   nuanced description to co-exist in a record along with a basic
1572	   description.  In that way both sophisticated and naive recipients of
1573	   the record can extract the level of meaning from it that best suits
1574	   their needs and abilities.  Key to unlocking the richer description
1575	   is a controlled vocabulary of ERC record types (not explained in this
1576	   document) that permit knowledgeable recipients to apply defined sets
1577	   of additional assumptions to the record.

1579	7.1.  ERC Syntax

1581	   An ERC record is a sequence of metadata elements ending in a blank
1582	   line.  An element consists of a label, a colon, and an optional
1583	   value.  Here is an example of a record with five elements.

1585	          erc:
1586	          who: Gibbon, Edward
1587	          what: The Decline and Fall of the Roman Empire
1588	          when: 1781
1589	          where: http://www.ccel.org/g/gibbon/decline/

1591	   A long value may be folded (continued) onto the next line by
1592	   inserting a newline and indenting the next line.  A value can be thus
1593	   folded across multiple lines.  Here are two example elements, each
1594	   folded across four lines.

1596	          who/created: University of California, San Francisco, AIDS
1597	               Program at San Francisco General Hospital | University
1598	               of California, San Francisco, Center for AIDS Prevention
1599	               Studies
1600	          what/Topic:
1601	                Heart Attack | Heart Failure
1602	               | Heart
1603	                                Diseases

1605	   An element value folded across several lines is treated as if the
1606	   lines were joined together on one long line.  For example, the second
1607	   element from the previous example is considered equivalent to

1609	          what/Topic: Heart Attack | Heart Failure | Heart Diseases

1611	   An element value may contain multiple values, each one separated from
1612	   the next by a `|' (pipe) character.  The element from the previous
1613	   example contains three values.

1615	   For annotation purposes, any line beginning with a `#' (hash)
1616	   character is treated as if it were not present; this is a "comment"
1617	   line (a feature not available in email or HTTP headers).  For
1618	   example, the following element is spread across four lines and
1619	   contains two values:

1621	          what/Topic:
1622	               Heart Attack
1623	          #    | Heart Failure  -- hold off until next review cycle
1624	               | Heart Diseases

1626	7.2.  ERC Stories

1628	   An ERC record is organized into one or more distinct segments, where
1629	   where each segment tells a story about a different aspect of the
1630	   information resource.  A segment boundary occurs whenever a segment
1631	   label (an element beginning with "erc") is encountered.  The basic
1632	   label "erc:" introduces the story of an object's expression (e.g.,
1633	   its publication, installation, or performance).  The label "erc-
1634	   about:" introduces the story of an object's content (what it is
1635	   about) and "erc-support:" introduces the story of a support
1636	   commitment made to it.  A story segment that concerns the ERC itself
1637	   is introduced by the label "erc-from:".  It is an important segment
1638	   that tells the story of the ERC's provenance.  Elements beginning
1639	   with "erc" are reserved for segment labels and their associated story
1640	   types.  From an earlier example, here is an ERC with two segments.

1642	         erc:
1643	         who:    Lederberg, Joshua
1644	         what:   Studies of Human Families for Genetic Linkage
1645	         when:   1974
1646	         where:  http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf
1647	         erc-support:
1648	         who:    NIH/NLM/LHNCBC
1649	         what:   Permanent, Unchanging Content
1650	         # Note to ops staff:  date needs verification.
1651	         when:   2001 04 21
1652	         where:  http://ark.nlm.nih.gov/yy22948

1654	   Segment stories are told according to journalistic tradition.  While
1655	   any number of pertinent elements may appear in a segment, priority is
1656	   placed on answering the questions who, what, when, and where at the
1657	   beginning of each segment so that readers can make the most important
1658	   selection or rejection decisions as soon as possible.  To make things
1659	   simple, the listed ordering of the questions is maintained in each
1660	   segment (as it happens most people who have been exposed to this
1661	   story telling technique are already familiar with the above
1662	   ordering).

1664	   The four questions are answered by using corresponding element
1665	   labels.  The four element labels can be re-used in each story
1666	   segment, but their meaning changes depending on the segment (the
1667	   story type) in which they appear.  In the example above, "who" is
1668	   first used to name a document's author and subsequently used to name
1669	   the permanence guarantor (provider).  Similarly, "when" first lists
1670	   the date of object creation and in the next segment lists the date of
1671	   a commitment decision.  Four labels appearing across three segments
1672	   effectively map to twelve semantically distinct elements.  Distinct
1673	   element meanings are mapped to Dublin Core elements in a later
1674	   section.

1676	7.3.  The ERC Anchoring Story

1678	   Each ERC contains an anchoring story.  It is usually the first
1679	   segment labeled "erc:" and it concerns an "anchoring" expression of
1680	   the object.  An "anchoring" expression is the one that a provider
1681	   deemed the most suitable basic referent given the audience and
1682	   application for which it produced the ERC.  If it sounds like the
1683	   provider has great latitude in choosing its anchoring expression, it
1684	   is because it does.  A typical anchoring story in an ERC for a born-
1685	   digital document would be the story of the document's release on a
1686	   web site; such a document would then be the anchoring expression.

1688	   An anchoring story need not be the central descriptive goal of an ERC
1689	   record.  For example, a museum provider may create an ERC for a
1690	   digitized photograph of a painting but choose to anchor it in the
1691	   story of the original painting instead of the story of the electronic
1692	   likeness; although the ERC may through other segments prove to be
1693	   centrally concerned with describing the electronic likeness, the
1694	   provider may have chosen this particular anchoring story in order to
1695	   make the ERC visible in a way that is most natural to patrons (who
1696	   would find the Mona Lisa under da Vinci sooner than they would find
1697	   it under the name of the person who snapped the photograph or scanned
1698	   the image).  In another example, a provider that creates an ERC for a
1699	   dramatic play as an abstract work has the task of describing a piece
1700	   of intangible intellectual property.  To anchor this abstract object
1701	   in the concrete world, if only through a derivative expression, it
1702	   makes sense for the provider to choose a suitable printed edition of
1703	   the play as the anchoring object expression (to describe in the
1704	   anchoring story) of the ERC.

1706	   The anchoring story has special rules designed to keep ERC processing
1707	   simple and predictable.  Each of the four basic elements (who, what,
1708	   when, and where) must be present, unless a best effort to supply it
1709	   fails.  In the event of failure, the element still appears but a
1710	   special value (described later) is used to explain the missing value.
1711	   While the requirement that each of the four elements be present only
1712	   applies to the anchoring story segment, as usual these elements
1713	   appear at the beginning of the segment and may only be used in the
1714	   prescribed order.  A minimal ERC would normally consist of just an
1715	   anchoring story and the element quartet, as illustrated in the next
1716	   example.

1718	         erc:
1719	         who:   National Research Council
1720	         what:  The Digital Dilemma
1721	         when:  2000
1722	         where: http://books.nap.edu/html/digital%5Fdilemma

1724	   A minimal ERC can be abbreviated so that it resembles a traditional
1725	   compact bibliographic citation that is nonetheless completely machine
1726	   processable.  The required elements and ordering makes it possible to
1727	   eliminate the element labels, as shown here.

1729	         erc: National Research Council | The Digital Dilemma | 2000
1730	                | http://books.nap.edu/html/digital%5Fdilemma

1732	7.4.  ERC Elements

1734	   As mentioned, the four basic ERC elements (who, what, when, and
1735	   where) take on different specific meanings depending on the story
1736	   segment in which they are used.  By appearing in each segment, albeit
1737	   in different guises, the four elements serve as a valuable mnemonic
1738	   device - a kind of checklist - for constructing minimal story
1739	   segments from scratch.  Again, it is only in the anchoring segment
1740	   that all four elements are mandatory.

1742	   Here are some mappings between ERC elements and Dublin Core [DCORE]
1743	   elements.

1745	          Segment     ERC Element     Equivalent Dublin Core Element
1746	         ---------    -----------     ------------------------------
1747	            erc          who          Creator/Contributor/Publisher
1748	            erc          what                Title
1749	            erc          when                Date
1750	            erc          where               Identifier
1751	         erc-about       who                  <none>
1752	         erc-about       what                Subject
1753	         erc-about       when                Coverage (temporal)
1754	         erc-about       where               Coverage (spatial)

1756	   The basic element labels may also be qualified to add nuances to the
1757	   semantic categories that they identify.  Elements are qualified by
1758	   appending a `/' (slash) and a qualifier term.  Often qualifier terms
1759	   appear as the past tense form of a verb because it makes re-using
1760	   qualifiers among elements easier.

1762	         who/published:  ...
1763	         when/published: ...
1764	         where/published: ...

1766	   Using past tense verbs for qualifiers also reminds providers and
1767	   recipients that element values contain transient assertions that may
1768	   have been true once, but that tend to become less true over time.
1769	   Recipients that don't understand the meaning of a qualifier can fall
1770	   back onto the semantic category (bucket) designated by the
1771	   unqualified element label.  Inevitably recipients (people and
1772	   software) will have diverse abilities in understanding elements and
1773	   qualifiers.

1775	   Any number of other elements and qualifiers may be used in
1776	   conjunction with the quartet of basic segment questions.  The only
1777	   semantic requirement is that they pertain to the segment's story.
1778	   Also, it is only the four basic elements that change meaning
1779	   depending on their segment context.  All other elements have meaning
1780	   independent of the segment in which they appear.  If an element label
1781	   stripped of its qualifier is still not recognized by the recipient, a
1782	   second fall back position is to ignore it and rely on the four basic
1783	   elements.

1785	   Elements may be either Canonical, Provisional, or Local.  Canonical
1786	   elements are officially recognized via a registry as part of the
1787	   metadata vernacular.  All elements, qualifiers, and segment labels
1788	   used in this document up until now belong to that vernacular.
1789	   Provisional elements are also officially recognized via the registry,
1790	   but have only been proposed for inclusion in the vernacular.  To be
1791	   promoted to the vernacular, a provisional element passes through a
1792	   vetting process during which its documentation must be in order and
1793	   its community acceptance demonstrated.  Local elements are any
1794	   elements not officially recognized in the registry.  The registry
1795	   [Kernel] is a work in progress.

1797	   Local elements can be immediately distinguishable from Canonical or
1798	   Provisional elements because all terms that begin with an upper case
1799	   letter are reserved for spontaneous local use.  No term beginning
1800	   with an upper case letter will ever be assigned Canonical or
1801	   Provisional status, so it should be safe to use such terms for local
1802	   purposes.  Any recipient of external ERCs containing such terms will
1803	   understand them to be part of the originating provider's local
1804	   metadata dialect.  Here's an example ERC with three segments, one
1805	   local element, and two local qualifiers.  The segment boundaries have
1806	   been emphasized by comment lines (which, as before, are ignored by
1807	   processors).

1809	         erc:
1810	         who: Bullock, TH | Achimowicz, JZ | Duckrow, RB
1811	                 | Spencer, SS | Iragui-Madoz, VJ
1812	         what: Bicoherence of intracranial EEG in sleep,
1813	                 wakefulness and seizures
1814	         when: 1997 12 00
1815	         where: http://cogprints.soton.ac.uk/%{
1816	                 documents/disk0/00/00/01/22/index.html %}
1817	         in: EEG Clin Neurophysiol | 1997 12 00 | v103, i6, p661-678
1818	         IDcode: cog00000122
1819	         # ---- new segment ----
1820	         erc-about:
1821	         what/Subcategory: Bispectrum | Nonlinearity | Epilepsy
1822	                 | Cooperativity | Subdural | Hippocampus | Higher moment
1823	         # ---- new segment ----
1824	         erc-from:
1825	         who: NIH/NLM/NCBI
1826	         what: pm9546494
1827	         when/Reviewed: 1998 04 18 021600
1828	         where: http://ark.nlm.nih.gov/12025/pm9546494?

1830	   The local element "IDcode" immediately precedes the "erc-about"
1831	   segment, which itself contains an element with the local qualifier
1832	   "Subcategory".  The second to last element also carries the local
1833	   qualifier "Reviewed".  Finally, what might be a provisional element
1834	   "in" appears near the end of the first segment.  It might have been
1835	   proposed as a way to complete a citation for an object originally
1836	   appearing inside another object (such as an article appearing in a
1837	   journal or an encyclopedia).

1839	7.5.  ERC Element Values

1841	   ERC element values tend to be straightforward strings.  If the
1842	   provider intends something special for an element, it will so
1843	   indicate with markers at the beginning of its value string.  The
1844	   markers are designed to be uncommon enough that they would not likely
1845	   occur in normal data except by deliberate intent.  Markers can only
1846	   occur near the beginning of a string, and once any octet of non-
1847	   marker data has been encountered, no further marker processing is
1848	   done for the element value.  In the absence of markers the string is
1849	   considered pure data; this has been the case with all the examples
1850	   seen thus far.  The fullest form of an element value with all three
1851	   optional markers in place looks like this.

1853	         VALUE =    [markup_flags]    (:ccode)    ,    DATA

1855	   In processing, the first non-whitespace character of an ERC element
1856	   value is examined.  An initial `[' is reserved to introduce a
1857	   bracketed set of markup flags (not described in this document) that
1858	   ends with `]'.  If ERC data is machine-generated, each value string
1859	   may be preceded by "[]" to prevent any of its data from being
1860	   mistaken for markup flags.  Once past the optional markup, the
1861	   remaining value may optionally begin with a controlled code.  A
1862	   controlled code always has the form "(:ccode)", for example,

1864	         who: (:unkn) Anonymous
1865	         what: (:791) Bee Stings

1867	   Any string after such a code is taken to be an uncontrolled (e.g.,
1868	   natural language) equivalent.  The code "unkn" indicates a
1869	   conventional explanation for a missing value (stating that the value
1870	   is unknown).  The remainder of the string makes an equivalent
1871	   statement in a form that the provider deemed most suitable to its
1872	   (probably human) audience.  The code "791" could be a fixed numeric
1873	   topic identifier within an unspecified topic vocabulary.  Any code
1874	   may be ignored by those that do not understand it.

1876	   There are several codes to explain different ways in which a required
1877	   element's value may go missing.

1879	         (:unac)   temporarily inaccessible
1880	         (:unal)   unallowed, suppressed intentionally
1881	         (:unap)   not applicable, makes no sense
1882	         (:unas)   value unassigned (e.g., Untitled)
1883	         (:unav)   value unavailable indefinitely
1884	         (:unkn)   unknown (e.g., Anonymous, Inconnue)
1885	         (:etal)   too numerous to list (I<et alia>).
1886	         (:none)   never had a value, never will
1887	         (:null)   explicitly empty
1888	         (:tba)    to be assigned or announced later

1890	   Once past an optional controlled code, the remaining string value is
1891	   subjected to one final test.  If the first next non-whitespace
1892	   character is a `,' (comma), it indicates that the string value is
1893	   "sort-friendly".  This means that the value is (a) laid out with an
1894	   inverted word order useful for sorting items having comparably laid
1895	   out element values (items might be the containing ERC records) and
1896	   (b) that the value may contain other commas that indicate inversion
1897	   points should it become necessary to recover the value in natural
1898	   word order.  Typically, this feature is used to express Western-style
1899	   personal names in family-name-given-name order.  It can also be used
1900	   wherever natural word order might make sorting tricky, such as when
1901	   data contains titles or corporate names.  Here are some example
1902	   elements.

1904	         who:   ,  van Gogh, Vincent
1905	         who:,Howell, III, PhD, 1922-1987, Thurston
1906	         who:, Acme Rocket Factory, Inc., The
1907	         who:, Mao Tse Tung
1908	         who:, McCartney, Paul, Sir,
1909	         what:, Health and Human Services, United States Government
1910	                 Department of, The,

1912	   There are rules to use in recovering a copy of the value in natural
1913	   word order, if desired.  The above example strings have the following
1914	   natural word order values, respectively.

1916	         Vincent van Gogh
1917	         Thurston Howell, III, PhD, 1922-1987
1918	         The Acme Rocket Factory, Inc.
1919	         Mao Tse Tung
1920	         Sir Paul McCartney
1921	         The United States Government Department of Health and Human Services

1923	7.6.  ERC Element Encoding and Dates

1925	   Some characters that need to appear in ERC element values might
1926	   conflict with special characters used for structuring ERCs, so there
1927	   needs to be a way to include them as literal characters that are
1928	   protected from special interpretation.  This is accomplished through
1929	   an encoding mechanism that resembles the %-encoding familiar to [URI]
1930	   handlers.

1932	   The ERC encoding mechanism also uses `%', but instead of taking two
1933	   following hexadecimal digits, it takes one non-alphanumeric character
1934	   or two alphabetic characters that cannot be mistaken for hex digits.
1935	   It is designed not to be confused with normal web-style %-encoding.
1936	   In particular it can be decoded without risking unintended decoding
1937	   of normal %-encoded data (which would introduce errors).  Here are
1938	   the one-character (non-alphanumeric) ERC encoding extensions.

1940	         ERC       Purpose
1941	         ---     ------------------------------------------------
1942	         %!      decodes to the element separator `|'
1943	         %%      decodes to a percent sign `%'
1944	         %.      decodes to a comma `,'
1945	         %_      a non-character used as syntax shim
1946	         %{      a non-character that begins an expansion block
1947	         %}      a non-character that ends an expansion block

1949	   One particularly useful construct in ERC element values is the pair
1950	   of special encoding markers ("%{" and "%}") that indicates a
1951	   "expansion" block.  Whatever string of characters they enclose will
1952	   be treated as if none of the contained whitespace (SPACEs, TABs,
1953	   Newlines) were present.  This comes in handy for writing long, multi-
1954	   part URLs in a readable way.  For example, the value in

1956	         where: http://foo.bar.org/node%{
1957	                    ? db = foo
1958	                    & start = 1
1959	                    & end = 5
1960	                    & buf = 2
1961	                    & query = foo + bar + zaf
1962	                %}

1964	   is decoded into an equivalent element, but with a correct and intact
1965	   URL:

1967	     where:
1968	      http://foo.bar.org/node?db=foo&start=1&end=5&buf=2&query=foo+bar+zaf

1970	   In a parting word about ERC element values, a commonly recurring
1971	   value type is a date, possibly followed by a time.  ERC dates use the
1972	   [TEMPER] format, taking on one of the following forms:

1974	         1999                (four digit year)
1975	         2000 12 29          (year, month, day)
1976	         2000 12 29 235955   (year, month, day, hour, minute, second)

1978	   In dates, all internal whitespace is squeezed out to achieve a
1979	   normalized form suitable for lexical comparison and sorting.  This
1980	   means that the following dates

1982	         2000 12 29 235955           (recommended for readability)
1983	         2000 12 29 23 59 55
1984	         20001229 23 59 55
1985	         20001229235955              (normalized date and time)

1987	   are all equivalent.  The first form is recommended for readability.
1988	   The last form (shortest and easiest to compute with) is the
1989	   normalized form.  Hyphens and commas are reserved to create date
1990	   ranges and lists, for example,

1992	         1996-2000                   (a range of four years)
1993	         1952, 1957, 1969            (a list of three years)
1994	         1952, 1958-1967, 1985       (a mixed list of dates and ranges)
1995	         20001229-20001231           (a range of three days)

1997	7.7.  ERC Stub Records and Internal Support

1999	   The ERC design introduces the concept of a "stub" record, which is an
2000	   incomplete ERC record intended to be supplemented with additional
2001	   elements before being released as a standalone ERC record.  A stub
2002	   ERC record has no minimum required elements.  It is just a group of
2003	   elements that does not begin with "erc:" but otherwise conforms to
2004	   the ERC record syntax.

2006	   ERC stubs may be useful in supporting internal procedures using the
2007	   ERC syntax.  Often they rely on the convenience and accuracy of
2008	   automatically supplied elements, even the basic ones.  To be ready
2009	   for external use, however, an ERC stub must be transformed into a
2010	   complete ERC record having the usual required elements.  An ERC stub
2011	   record can be convenient for metadata embedded in a document, where
2012	   elements such as location, modification date, and size - which one
2013	   would not omit from an externalized record - are omitted simply
2014	   because they are much better supplied by a computation.  A separate
2015	   local administrative procedure, not defined for ERC's in general,
2016	   would effect the promotion of stubs into complete records.

2018	   While the ERC is a general-purpose container for exchange of resource
2019	   descriptions, it does not dictate how records must be internally
2020	   stored, laid out, or assembled by data providers or recipients.
2021	   Arbitrary internal descriptive frameworks can support ERCs simply by
2022	   mapping (e.g., on demand) local records to the ERC container format
2023	   and making them available for export.  Therefore, to support ERCs
2024	   there is no need for a data provider to convert internal data to be
2025	   stored in an ERC format.  On the other hand, any provider (such as
2026	   one just getting started in the business of resource description) may
2027	   choose to store and manipulate local data natively in the ERC format.

2029	8.  Advice to Web Clients

2031	   This section offers some advice to web client software developers.
2032	   It is hard to write about because it tries to anticipate a series of
2033	   events that might lead to native web browser support for ARKs.

2035	   ARKs are envisaged to appear wherever durable object references are
2036	   planned.  Library cataloging records, literature citations, and
2037	   bibliographies are important examples.  In many of these places URLs
2038	   (Uniform Resource Locators) currently stand in, and URNs, DOIs, and
2039	   PURLs have been proposed as alternatives.

2041	   The strings representing ARKs are also envisaged to appear in some of
2042	   the places where URLs currently appear:  in hypertext links (where
2043	   they are not normally shown to users) and in rendered text (displayed
2044	   or printed).  Internet search engines, for example, tend to include
2045	   both actionable and manifest links when listing each item found.  A
2046	   normal HTML link for which the URL is not displayed looks like this.

2048	          <a href = "http://foo.bar.org/index.htm"> Click Here <a>

2050	   The same link with an ARK instead of a URL:

2052	          <a href = "ark:/14697/b12345x"> Click Here <a>

2054	   Web browsers would in general require a small modification to
2055	   recognize and convert this ARK, via mapping authority discovery, to
2056	   the URL form.

2058	          <a href = "http://a.b.org/ark:/14697/b12345x"> Click Here <a>

2060	   A browser that knows how to make that conversion could also
2061	   automatically detect and replace a non-working NMAH.

2063	   An NAA will typically make known the associations it creates by
2064	   publishing them in catalogs, actively advertizing them, or simply
2065	   leaving them on web sites for visitors (e.g., users, indexing
2066	   spiders) to stumble across in browsing.

2068	9.  Security Considerations

2070	   The ARK naming scheme poses no direct risk to computers and networks.
2071	   Implementors of ARK services need to be aware of security issues when
2072	   querying networks and filesystems for Name Mapping Authority
2073	   services, and the concomitant risks from spoofing and obtaining
2074	   incorrect information.  These risks are no greater for ARK mapping
2075	   authority discovery than for other kinds of service discovery.  For
2076	   example, recipients of ARKs with a specified hostport (NMAH) should
2077	   treat it like a URL and be aware that the identified ARK service may
2078	   no longer be operational.

2080	   Apart from mapping authority discovery, ARK clients and servers
2081	   subject themselves to all the risks that accompany normal operation
2082	   of the protocols underlying mapping services (e.g., HTTP, Z39.50).
2083	   As specializations of such protocols, an ARK service may limit
2084	   exposure to the usual risks.  Indeed, ARK services may enhance a kind
2085	   of security by helping users identify long-term reliable references
2086	   to information objects.

2088	10.  Authors' Addresses

2090	   John A. Kunze
2091	   California Digital Library
2092	   University of California, Office of the President
2093	   415 20th St, 4th Floor
2094	   Oakland, CA  94612-3550, USA

2096	   Fax:   +1 510-893-5212
2097	   EMail: jak@ucop.edu

2099	   R. P. C. Rodgers
2100	   US National Library of Medicine
2101	   8600 Rockville Pike, Bldg. 38A
2102	   Bethesda, MD  20894, USA

2104	   Fax:   +1 301-496-0673
2105	   EMail: rodgers@nlm.nih.gov

2107	11.  References

2109	   [ANVL]     J. Kunze, B. Kahle, et al, "A Name-Value Language", work
2110	              in progress,
2111	              http://www.cdlib.org/inside/diglib/ark/anvlspec.pdf

2113	   [ARK]      J. Kunze, "Towards Electronic Persistence Using ARK
2114	              Identifiers", Proceedings of the 3rd ECDL Workshop on Web
2115	              Archives, August 2003, (PDF)
2116	              http://bibnum.bnf.fr/ecdl/2003/proceedings.php?f=kunze

2118	   [DCORE]    Dublin Core Metadata Initiative, "Dublin Core Metadata
2119	              Element Set, Version 1.1:  Reference Description", July
2120	              1999, http://dublincore.org/documents/dces/.

2122	   [DNS]      P.V. Mockapetris, "Domain Names - Concepts and
2123	              Facilities", RFC 1034, November 1987.

2125	   [DOI]      International DOI Foundation, "The Digital Object
2126	              Identifier (DOI) System", February 2001,
2127	              http://dx.doi.org/10.1000/203.

2129	   [ERC]      J. Kunze, "A Metadata Kernel for Electronic Permanence",
2130	              Journal of Digital Information, Vol 2, Issue 2, January
2131	              2002, ISSN 1368-7506, (PDF)
2132	              http://jodi.ecs.soton.ac.uk/Articles/v02/i02/Kunze/

2134	   [Handle]   L. Lannom, "Handle System Overview", ICSTI Forum, No. 30,
2135	              April 1999, http://www.icsti.org/forum/30/#lannom

2137	   [HTTP]     R. Fielding, et al, "Hypertext Transfer Protocol --
2138	              HTTP/1.1", RFC 2616, June 1999.

2140	   [Kernel]   Dublin Core Metadata Initiative, "Kernel Metadata Working
2141	              Group", http://dublincore.org/groups/kernel/

2143	   [MD5]      R. Rivest, "The MD5 Message-Digest Algorithm", RFC 1321,
2144	              April 1992.

2146	   [N2T]      CDL, "Name-to-Thing Resolover", work in progress, August
2147	              2006, http://n2t.info

2149	   [NAPTR]    M. Mealling, Daniel, R., "The Naming Authority Pointer
2150	              (NAPTR) DNS Resource Record", RFC 2915, September 2000.

2152	   [NLMPerm]  M. Byrnes, "Defining NLM's Commitment to the Permanence of
2153	              Electronic Information", ARL 212:8-9, October 2000,
2154	              http://www.arl.org/newsltr/212/nlm.html

2156	   [NOID]     J. Kunze, "Nice Opaque Identifiers", February 2005,
2157	              http://www.cdlib.org/inside/diglib/ark/noid.pdf

2159	   [PURL]     K. Shafer, et al, "Introduction to Persistent Uniform
2160	              Resource Locators", 1996,
2161	              http://purl.oclc.org/OCLC/PURL/INET96

2163	   [RFC822]   D. Crocker, "Standard for the format of ARPA Internet text
2164	              messages", RFC 822, August 1982.

2166	   [TELNET]   J. Postel, J.K. Reynolds, "Telnet Protocol Specification",
2167	              RFC 854, May 1983.

2169	   [TEMPER]   J. Kunze, "Temporal Enumerated Ranges", work in progress,
2170	              http://www.cdlib.org/inside/diglib/ark/temperspec.pdf

2172	   [THUMP]    K. Gamiel, J. Kunze, "The HTTP URL Mapping Protocol", work
2173	              in progress, http://www.ietf.org/internet-drafts/draft-
2174	              kunze-thump-00.txt

2176	   [URI]      T. Berners-Lee, et al, "Uniform Resource Identifiers
2177	              (URI): Generic Syntax", RFC 2396, August 1998.

2179	   [URNBIB]   C. Lynch, et al, "Using Existing Bibliographic Identifiers
2180	              as Uniform Resource Names", RFC 2288, February 1998.

2182	   [URNSYN]   R. Moats, "URN Syntax", RFC 2141, May 1997.

2184	   [URNNID]   L. Daigle, et al, "URN Namespace Definition Mechanisms",
2185	              RFC 2611, June 1999.

2187	12.  Appendix:  ARK Implementations

2189	   Currently, the primary implementation activity is at the California
2190	   Digital Library (CDL),

2192	         http://ark.cdlib.org/

2194	   housed at the University of California Office of the President, where
2195	   over 200,000 ARKs have been assigned to objects that the CDL owns or
2196	   controls.  Some experimentation in ARKs is taking place at JSTOR, the
2197	   Digital Curation Centre, WIPO and at the University of California's
2198	   San Diego, San Francisco, and Berkeley campuses.

2200	   The US National Library of Medicine (NLM) also has an experimental,
2201	   prototype ARK service under development.  It is being made available
2202	   for purposes of demonstrating various aspects of the ARK system, but
2203	   is subject to temporary or permanent withdrawal (without notice)
2204	   depending upon the circumstances of the small research group
2205	   responsible for making it available.  It is described at:

2207	         http://ark.nlm.nih.gov/

2209	   Comments and feedback may be addressed to rodgers@nlm.nih.gov.

2211	13.  Appendix:  Current ARK Name Authority Table

2213	   This appendix contains a copy of the Name Authority Table (a file) at
2214	   the time of writing.  It may be loaded into a local filesystem (e.g.,
2215	   /etc/natab) for use in mapping NAAs (Name Assigning Authorities) to
2216	   NMAHs (Name Mapping Authority Hostports).  It contains Perl code that
2217	   can be copied into a standalone script that processes the table (as a
2218	   file).  Because this is still a proposed file, none of the values in
2219	   it are real.

2221	     #
2222	     # Name Assigning Authority / Name Mapping Authority Lookup Table
2223	     #    Last change:   2007.06.05
2224	     #       Reload from:   http://ark.nlm.nih.gov/etc/natab
2225	     #       Mirrored at:   http://www.cdlib.org/inside/diglib/ark/natab
2226	     #       To register:   mailto:ark@cdlib.org?Subject=naareg
2227	     #       Process with:  Perl script at end of this file (optional)
2228	     #
2229	     # Each NAA appears at the beginning of a line with the NAA Number
2230	     # first, a colon, and an ARK or URL to a statement of naming policy
2231	     # (see http://ark.cdlib.org for an example).
2232	     # All the NMA hostports that service an NAA are listed, one per
2233	     # line, indented, after the corresponding NAA line.
2234	     #
2235	     #       National Library of Medicine
2236	     12025:  http://www.nlm.nih.gov/xxx/naapolicy.html
2237	             ark.nlm.nih.gov USNLM
2238	             foobar.zaf.org UCSF
2239	     #
2240	     #       Library of Congress
2241	     12026:  http://www.loc.gov/xxx/naapolicy.html
2242	             foobar.zaf.org USLC
2243	     #
2244	     #       National Agriculture Library
2245	     12027:  http://www.nal.gov/xxx/naapolicy.html
2246	             foobar.zaf.gov:80 USNAL
2247	     #
2248	     #       California Digital Library
2249	     13030:  http://www.cdlib.org/inside/diglib/ark/
2250	             ark.cdlib.org CDL
2251	     #
2252	     #       World Intellectual Property Organization
2253	     13038:  http://www.wipo.int/xxx/naapolicy.html
2254	             www.wipo.int WIPO
2255	     #
2256	     #       University of California San Diego
2257	     20775:  http://library.ucsd.edu/xxx/naapolicy.html
2258	             ucsd.edu UCSD
2259	     #
2260	     #       University of California San Francisco
2261	     29114:  http://library.ucsf.edu/xxx/naapolicy.html
2262	             ucsf.edu UCSF
2263	     #
2264	     #       University of California Berkeley
2265	     28722:  http://library.berkeley.edu/xxx/naapolicy.html
2266	             berkeley.edu UCB
2267	     #
2268	     #       University of California Los Angeles
2269	     21198:  http://library.ucla.edu/xxx/naapolicy.html
2270	             ucla.edu UCLA
2271	     #
2272	     #       Rutgers University
2273	     15230:  http://rci.rutgers.edu/xxx/naapolicy.html
2274	             rutgers.edu RU
2275	     #
2276	     #       Internet Archive
2277	     13960:  http://www.archive.org/xxx/naapolicy.html
2278	             archive.org IA
2279	     #
2280	     #       Digital Curation Centre
2281	     64269:  http://www.dcc.ac.uk/xxx/naapolicy.html
2282	             dcc.ac.uk DCC
2283	     #
2284	     #       New York University
2285	     62624:  http://library.nyu.edu/xxx/naapolicy.html
2286	             nyu.edu NYU
2287	     #
2288	     #       University of North Texas
2289	     67531:  http://www.library.unt.edu/xxx/naapolicy.html
2290	             unt.edu UNT
2291	     #
2292	     #       Ithaka Electronic-Archiving Initiative
2293	     27927:  http://www.ithaka.org/xxx/naapolicy.html
2294	             ithaka.org ITHAKA
2295	     #
2296	     #       Bibliothque nationale de France / National Library of France
2297	     12148:  http://www.bnf.fr/xxx/naapolicy.html
2298	             bnf.fr BNF
2299	     #
2300	     #       Princeton University
2301	     88435:  http://diglib.princeton.edu/xxx/naapolicy.html
2302	             princeton.edu PU
2303	     #
2304	     #       University of Washington
2305	     78428:  http://u.washington.edu/xxx/naapolicy.html
2306	             u.washington.edu UW
2307	     #
2308	     #       Archives of Region of Vstra Gtaland and City of Gothenburg, Sweden
2309	     89901:  http://www.arkivnamnden.org/xxx/naapolicy.html
2310	             arkivnamnden.org AVGG
2311	     #
2312	     #       Northwest Digital Archives
2313	     80444:  http://nwda.wsulibs.wsu.edu/xxx/naapolicy.html
2314	             nwda.wsulibs.wsu.edu NWDA
2315	     #
2316	     #       Emory University
2317	     25593:  http://id.library.emory.edu/xxx/naapolicy.html
2318	             id.library.emory.edu EMORY
2319	     #
2320	     #       University of Kansas
2321	     25031:  http://www.lib.ku.edu/xxx/naapolicy.html
2322	             www.lib.ku.edu UKANSAS

2324	     #
2325	     #       Google
2326	     78319:  http://www.google.com/xxx/naapolicy.html
2327	             www.google.com GOOGLE
2328	     #
2329	     #    Centre for Ecology & Hydrology, UK
2330	     17101:  http://www.ceh.ac.uk/xxx/naapolicy.html
2331	          www.ceh.ac.uk CEH
2332	     #
2333	     #    University of Calgary
2334	     65323:  http://library.ucalgary.ca/xxx/naapolicy.html
2335	          ucalgary.ca UCALGARY
2336	     #
2337	     #12345: reserved for examples
2338	     #
2339	     #--- end of data ---
2340	     # The following Perl script takes an NAA as argument and outputs
2341	     # the NMAs in this file listed under any matching NAA.
2342	     #
2343	     # my $naa = shift;
2344	     # while (<>) {
2345	     #       next if (! /^$naa:/);
2346	     #       while (<>) {
2347	     #               last if (! /^[#\s]./);
2348	     #               print "$1\n" if (/^\s+(\S+)/);
2349	     #       }
2350	     # }
2351	     #
2352	     # Create a g/t/nroff-safe version of this table with the UNIX command,
2353	     #
2354	     #       expand natab | sed 's/\\/\\\e/g' > natab.roff
2355	     #
2356	     # end of file

2358	14.  Copyright Notice

2360	   Copyright (C) The IETF Trust (2007).  This document is subject to the
2361	   rights, licenses and restrictions contained in BCP 78, and except as
2362	   set forth therein, the authors retain all their rights.

2364	   This document and the information contained herein are provided on an
2365	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
2366	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
2367	   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
2368	   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
2369	   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
2370	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

2372	Expires 24 January 2008
2373	                           Table of Contents

2375	Status of this Document  . . . . . . . . . . . . . . . . . . . . . .   1
2376	Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   1
2377	1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .   3
2378	1.1.  Reasons to Use ARKs  . . . . . . . . . . . . . . . . . . . . .   4
2379	1.2.  Three Requirements of ARKs . . . . . . . . . . . . . . . . . .   4
2380	1.3.  Organizing Support for ARKs:  Our Stuff vs. Their Stuff  . . .   5
2381	1.4.  Definition of Identifier . . . . . . . . . . . . . . . . . . .   7
2382	2.  ARK Anatomy  . . . . . . . . . . . . . . . . . . . . . . . . . .   8
2383	2.1.  The Name Mapping Authority Hostport (NMAH) . . . . . . . . . .   8
2384	2.2.  The ARK Label Part - ark:  . . . . . . . . . . . . . . . . . .   9
2385	2.3.  The Name Assigning Authority Number (NAAN) . . . . . . . . . .  10
2386	2.4.  The Name Part  . . . . . . . . . . . . . . . . . . . . . . . .  10
2387	2.5.  The Qualifier Part . . . . . . . . . . . . . . . . . . . . . .  11
2388	2.5.1.  ARKs that Reveal Object Hierarchy  . . . . . . . . . . . . .  12
2389	2.5.2.  ARKs that Reveal Object Variants . . . . . . . . . . . . . .  13
2390	2.6.  Character Repertoires  . . . . . . . . . . . . . . . . . . . .  14
2391	2.7.  Normalization and Lexical Equivalence  . . . . . . . . . . . .  15
2392	3.  Naming Considerations  . . . . . . . . . . . . . . . . . . . . .  16
2393	3.1.  ARKS Embedded in Language  . . . . . . . . . . . . . . . . . .  16
2394	3.2.  Objects Should Wear Their Identifiers  . . . . . . . . . . . .  17
2395	3.3.  Names are Political, not Technological . . . . . . . . . . . .  17
2396	3.4.  Choosing a Hostname or NMA . . . . . . . . . . . . . . . . . .  17
2397	3.5.  Assigners of ARKs  . . . . . . . . . . . . . . . . . . . . . .  19
2398	3.6.  NAAN Namespace Management  . . . . . . . . . . . . . . . . . .  20
2399	3.7.  Sub-Object Naming  . . . . . . . . . . . . . . . . . . . . . .  21
2400	4.  Finding a Name Mapping Authority . . . . . . . . . . . . . . . .  21
2401	4.1.  Looking Up NMAHs in a Globally Accessible File . . . . . . . .  22
2402	4.2.  Looking up NMAHs Distributed via DNS . . . . . . . . . . . . .  23
2403	5.  Generic ARK Service Definition . . . . . . . . . . . . . . . . .  26
2404	5.1.  Generic ARK Access Service (access, location)  . . . . . . . .  26
2405	5.2.  Generic Policy Service (permanence, naming, etc.)  . . . . . .  26
2406	5.3.  Generic Description Service  . . . . . . . . . . . . . . . . .  28
2407	6.  Overview of The HTTP URL Mapping Protocol (THUMP)  . . . . . . .  28
2408	7.  Overview of Electronic Resource Citations (ERCs) . . . . . . . .  31
2409	7.1.  ERC Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  33
2410	7.2.  ERC Stories  . . . . . . . . . . . . . . . . . . . . . . . . .  34
2411	7.3.  The ERC Anchoring Story  . . . . . . . . . . . . . . . . . . .  35
2412	7.4.  ERC Elements . . . . . . . . . . . . . . . . . . . . . . . . .  36
2413	7.5.  ERC Element Values . . . . . . . . . . . . . . . . . . . . . .  38
2414	7.6.  ERC Element Encoding and Dates . . . . . . . . . . . . . . . .  40
2415	7.7.  ERC Stub Records and Internal Support  . . . . . . . . . . . .  41
2416	8.  Advice to Web Clients  . . . . . . . . . . . . . . . . . . . . .  42
2417	9.  Security Considerations  . . . . . . . . . . . . . . . . . . . .  43
2418	10.  Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . .  43
2419	11.  References  . . . . . . . . . . . . . . . . . . . . . . . . . .  44
2420	12.  Appendix:  ARK Implementations  . . . . . . . . . . . . . . . .  45
2421	13.  Appendix:  Current ARK Name Authority Table . . . . . . . . . .  46
2422	14.  Copyright Notice  . . . . . . . . . . . . . . . . . . . . . . .  49