idnits 2.17.1 

draft-kunze-ark-06.txt:
-(62): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(270): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(333): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(380): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(400): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(422): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(424): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(430): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(432): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(770): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(804): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(808): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(985): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1040): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1169): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1193): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1364): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1366): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1367): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1377): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1379): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1382): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1406): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1407): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1408): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1432): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1448): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1465): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1475): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1526): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1631): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1637): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1640): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding
-(1641): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == There are 53 instances of lines with non-ascii characters in the
     document.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 40
     longer pages, the longest (page 2) being 63 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 40 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 11 instances of too long lines in the document, the longest
     one being 7 characters in excess of 72.

  == There are 15 instances of lines with non-RFC2606-compliant FQDNs in the
     document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 555 has weird spacing: '...eful to  remem...'

  == Line 759 has weird spacing: '... regexp  repla...'

  == Line 1829 has weird spacing: '...for the  purpo...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (31 July 2003) is 7574 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'MD5' is defined on line 1707, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DCORE'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DOI'

  ** Obsolete normative reference: RFC  822 (ref. 'EMHDRS') (Obsoleted by RFC
     2822)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ERC'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'HKMP'

  ** Obsolete normative reference: RFC 2616 (ref. 'HTTP') (Obsoleted by RFC
     7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref.
     'MD5')

  ** Obsolete normative reference: RFC 2915 (ref. 'NAPTR') (Obsoleted by RFC
     3401, RFC 3402, RFC 3403, RFC 3404)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'NLMPerm'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'PURL'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'REG'

  ** Obsolete normative reference: RFC 2396 (ref. 'URI') (Obsoleted by RFC
     3986)

  ** Downref: Normative reference to an Informational RFC: RFC 2288 (ref.
     'URNBIB')

  ** Obsolete normative reference: RFC 2141 (ref. 'URNSYN') (Obsoleted by RFC
     8141)

  ** Obsolete normative reference: RFC 2611 (ref. 'URNNID') (Obsoleted by RFC
     3406)


     Summary: 14 errors (**), 0 flaws (~~), 10 warnings (==), 10 comments
     (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet-Draft: draft-kunze-ark-06.txt                          J. Kunze
3	ARK Identifier Scheme                    University of California (UCOP)
4	Expires 31 January 2004                                 R. P. C. Rodgers
5	                                         US National Library of Medicine
6	                                                            31 July 2003

8	                  The ARK Persistent Identifier Scheme

10	      (http://www.ietf.org/internet-drafts/draft-kunze-ark-06.txt)

12	Status of this Document

14	   This document is an Internet-Draft and is in full conformance with
15	   all provisions of Section 10 of RFC2026.

17	   Internet-Drafts are working documents of the Internet Engineering
18	   Task Force (IETF), its areas, and its working groups.  Note that
19	   other groups may also distribute working documents as Internet-
20	   Drafts.

22	   Internet-Drafts are draft documents valid for a maximum of six months
23	   and may be updated, replaced, or obsoleted by other documents at any
24	   time.  It is inappropriate to use Internet-Drafts as reference
25	   material or to cite them other than as ``work in progress.''

27	   The list of current Internet-Drafts can be accessed at
28	   http://www.ietf.org/ietf/1id-abstracts.txt

30	   The list of Internet-Draft Shadow Directories can be accessed at
31	   http://www.ietf.org/shadow.html.

33	   Distribution of this document is unlimited.  Please send comments to
34	   jak@ucop.edu.

36	   Copyright (C) The Internet Society (2003).  All Rights Reserved.

38	Abstract

40	   The ARK (Archival Resource Key) is a scheme intended to facilitate
41	   the persistent naming and retrieval of information objects.  It
42	   comprises an identifier syntax and three services.  An ARK has four
43	   components:

45	                    [http://NMAH/]ark:/NAAN/Name

47	   an optional and mutable Name Mapping Authority Hostport part (NMAH,
48	   where "hostport" is a hostname followed optionally by a colon and
49	   port number), the "ark:" label, the Name Assigning Authority Number
50	   (NAAN), and the assigned Name.  The NAAN and Name together form the
51	   immutable persistent identifier for the object.

53	   An ARK request is an ARK with a service request and a question mark
54	   appended to it.  Use of an ARK request proceeds in two steps.  First,
55	   the NMAH, if not specified, is discovered based on the NAAN.  Two
56	   discovery methods are proposed:  one is file based, the other based
57	   on the DNS NAPTR record.  Second, the ARK request is submitted to the
58	   NMAH.  Three ARK services are defined, gaining access to:  (1) the
59	   object (or a sensible substitute), (2) a description of the object
60	   (metadata), and (3) a description of the commitment made by the NMA
61	   regarding the persistence of the object (policy).  These services are
62	   defined initially to use the HTTP protocol.  When the NMAH is speci�
63	   fied, the ARK is a valid URL that can gain access to ARK services
64	   using an unmodified Web client.

66	1.  Introduction

68	   This document describes a scheme for the high-quality naming of
69	   information resources.  The scheme, called the Archival Resource Key
70	   (ARK), is well suited to long-term access and identification for any
71	   information resources that accommodate reasonably regular electronic
72	   description.  This includes digital documents, databases, software,
73	   and websites, as well as physical objects (such as books, bones, and
74	   statues) and intangible objects (chemicals, diseases, vocabulary
75	   terms, performances).  Hereafter the term "object" refers to an
76	   information resource.  The term ARK itself refers both to the scheme
77	   and to any single identifier that conforms to it.

79	   Schemes for persistent identification of network-accessible objects
80	   are not new.  In the early 1990's, the design of the Uniform Resource
81	   Name [URNSYN] responded to the observed failure rate of URLs by
82	   articulating an indirect, non-hostname-based naming scheme and the
83	   need for responsible name management.  Meanwhile, promoters of the
84	   Digital Object Identifier [DOI] succeeded in building a community of
85	   providers around a mature software system that supports name
86	   management.  The Persistent Uniform Resource Locator [PURL] was a
87	   third scheme that has the unique advantage of working with unmodified
88	   web browsers.  The ARK scheme is a new approach.

90	   A founding principle of the ARK is that persistence is purely a
91	   matter of service.  Persistence is neither inherent in an object nor
92	   conferred on it by a particular naming syntax.  Rather, persistence
93	   is achieved through a provider's successful stewardship of objects
94	   and their identifiers.  The highest level of persistence will be
95	   reinforced by a provider's robust contingency, redundancy, and
96	   succession strategies.  It is further safeguarded to the extent that
97	   a provider's mission is shielded from marketplace and political
98	   instabilities.

100	1.1.  Three Reasons to Use ARKs

102	   The first requirement of an ARK is to give users a link from an
103	   object to a promise of stewardship for it.  That promise is a multi-
104	   faceted covenant that binds the word of an identified service
105	   provider to a specific set of responsibilities.  No one can tell if
106	   successful stewardship will take place because no one can predict the
107	   future.  Reasonable conjecture, however, may be based on past
108	   performance.  There must be a way to tie a promise of persistence to
109	   a provider's demonstrated or perceived ability -- its reputation --
110	   in that arena.  Provider reputations would then rise and fall as
111	   promises are observed variously to be kept and broken.  This is
112	   perhaps the best way we have for gauging the strength of any
113	   persistence promise.

115	   The second requirement of an ARK is to give users a link from an
116	   object to a description of it.  The problem with a naked identifier
117	   is that without a description real identification is incomplete.
118	   Identifiers common today are relatively opaque, though some contain
119	   ad hoc clues that reflect fleeting life cycle events such as the
120	   address of a short stay in a filesystem hierarchy.  Possession of
121	   both an identifier and an object is some improvement, but positive
122	   identification may still be elusive since the object itself need not
123	   include a matching identifier or be transparent enough to reveal its
124	   identity without significant research.  In either case, what is
125	   called for is a record bearing witness to the identifier's
126	   association with the object, as supported by a recorded set of object
127	   characteristics.  This descriptive record is partly an identification
128	   "receipt" with which users and archivists can verify an object's
129	   identity after brief inspection and a plausible match with recorded
130	   characteristics such as title and size.

132	   The final requirement of an ARK is to give users a link to the object
133	   itself (or to a copy) if at all possible.  Persistent access is the
134	   central duty of an ARK, with persistent identification playing a
135	   vital but supporting role.  Object access may not be feasible for
136	   various reasons, such as catastrophic loss of the object, a licensing
137	   agreement that keeps an archive "dark" for a period of years, or when
138	   an object's own lack of tangible existence precludes normal concepts
139	   of access (e.g., a vocabulary term might be accessed through its
140	   definition).  In such cases the ARK's identification role assumes a
141	   much higher profile.  But attempts to simplify the persistence
142	   problem by decoupling access from identification and concentrating
143	   exclusively on the latter are of questionable utility.  A perfect
144	   system for assigning forever unique identifiers might be created, but
145	   if it did so without reducing access failure rates, no one would be
146	   interested.  The central issue -- which may be summed up as the "HTTP
147	   404 Not Found" problem -- would not have been addressed.

149	1.2.  Organizing Support for ARKs

151	   Co-location of persistent access and identification services is
152	   natural.  Any organization that undertakes ongoing support of true
153	   persistent identification (which includes description) is well-served
154	   if it controls, owns, or otherwise has clear internal access to the
155	   identified objects, and this gives it an advantage if it wishes also
156	   to support persistent external access.  Conversely, the latter
157	   implies a commitment to collection management activities such as
158	   monitoring, acquisition, verification, and change control over
159	   objects that are persistently identified at least for the sake of
160	   internal record keeping and accountability; this covers the major
161	   prerequisite for external support of persistent identification.
162	   Organizing ARK services under one roof thus tends to make sense.

164	   ARK support is not for everybody.  By requiring specific, revealed
165	   commitments to preservation, object access, and description, the bar
166	   for providing ARK services is high.  On the other hand, it would be
167	   hard to grant credence to a persistence promise from an organization
168	   that could not muster the minimum ARK services.  Not that there isn't
169	   a business model for an ARK-like, description-only service built on
170	   top of another organization's full complement of ARK services.  For
171	   example, there might be competition at the description level for
172	   abstracting and indexing a body of scientific literature archived in
173	   a combination of open and fee-based repositories.  Such a business
174	   would benefit more from persistence than it would directly support
175	   it.

177	1.3.  A Definition of Identifier

179	   Heretofore, persistence discussion has been hampered by a borrowed
180	   meaning for "identifier" that emerged as a side effect of defining
181	   the Uniform Resource Identifier in [URI]:

183	        (formerly)  An identifier is a sequence of characters with a
184	        restricted syntax ... that can act as a reference to something
185	        that has identity.

187	   The term works in context, but falters when employed for persistence.
188	   Troubling phrases arise, such as,

190	        "The goal is to create an identifier that does not break."

192	   As defined this kind of identifier "breaks" when it sustains damage
193	   to its character sequence, but really what breaks has to do with the
194	   identifier's reference role.  The following definition is proposed.

196	        (new definition)  An identifier is an association between a
197	        string (a sequence of characters) and an information resource.
198	        That association is made manifest by a record (e.g., a
199	        cataloging or other metadata record) that binds the identifier
200	        string to a set of identifying resource characteristics.

202	   The identifier (the association) must be vouched for by some sort of
203	   record.  In the complete absence of any testimony (e.g., metadata)
204	   regarding an association, a would-be identifier string is a
205	   meaningless sequence of characters.  To keep an externally visible
206	   but otherwise internal identifier string opaque to outsiders, for
207	   example, it suffices for an organization not to disclose the nature
208	   of its association.  For our immediate purpose, actual existence of
209	   an association record is more important than its authenticity.  If
210	   one is lucky an object carries its own identifier as part of itself
211	   (e.g., imprinted on the first page), but in processes such as
212	   resource discovery and retrieval the typical object is often unwieldy
213	   or unavailable (such as when licensing restrictions are in effect).
214	   A metadata record that includes the identifier string is the next
215	   best thing -- a conveniently manipulable surrogate that can act as
216	   both an association "receipt" and "declaration".

218	   It now makes sense to speak of preventing an identifier, as an
219	   association, from breaking.  Having said that, this document still
220	   (ab)uses the terms "ARK" and "identifier" as shorthands to refer to
221	   identifier strings, in other words, to sequences of characters.  Thus
222	   a discussion of ARK syntax refers to a string format, not an
223	   association format.  The context should make the meaning clear.

225	2.  ARK Anatomy

227	   An ARK is represented by a sequence of characters (a string) that
228	   contains the label, "ark:", optionally preceded by the beginning part
229	   of a URL.  Here is a diagrammed example.

231	               http://foobar.zaf.org/ark:/12025/654xz321
232	               \___________________/ \__/ \___/ \______/
233	                 (replaceable)        |     |      |
234	                      |         ARK Label   |    Name (assigned by the NAA)
235	                      |                     |
236	        Name Mapping Authority             Name Assigning Authority
237	               Hostport (NMAH)              Number (NAAN)

239	   The ARK syntax can be summarized,

241	                    [http://NMAH/]ark:/NAAN/Name

243	   where the NMAH part is in brackets to indicate that it is temporary,
244	   replaceable, and optional.

246	2.1.  The Name Mapping Authority Hostport (NMAH)

248	   Before the "ark:" label may appear an optional Name Mapping Authority
249	   Hostport (NMAH) that is a temporary address where ARK service
250	   requests may be sent.  It consists of "http://" (or any service
251	   specification valid for a URL) followed by an Internet hostname or
252	   hostport combination having the same format and semantics as the
253	   hostport part of a URL.  The most important thing about the NMAH is
254	   that it is "identity inert" from the point of view of object
255	   identification.  In other words, ARKs that differ only in the
256	   optional NMAH part identify the same object.  Thus, for example, the
257	   following three ARKs are synonyms for but one information resource:

259	               http://foobar.zaf.org/ark:/12025/654xz321
260	             http://sneezy.dopey.com/ark:/12025/654xz321
261	                                     ark:/12025/654xz321

263	   The NMAH part makes an ARK into an actionable URL.  Conversely, any
264	   URL whose path component begins with "ark:/" stands a reasonable
265	   chance of being an ARK (only because such URLs are not common), but
266	   further verification is still required (such as probing the URL for
267	   the three ARK services).

269	   The NMAH part is temporary, disposable, and replaceable.  Over time
270	   the NMAH will likely stop working and have to be replaced with a cur�
271	   rently active service provider.  This relies on a mapping authority
272	   discovery process, of which two alternate methods are outlined in a
273	   later section.  Meanwhile, a carefully chosen NMAH can be as durable
274	   as any Internet domain name, and so may last for a decade or longer.
275	   Users should be prepared, however, to refresh the NMAH because the
276	   one found in the URL form of the ARK may have stopped working.

278	   The above method for creating an actionable identifier from a basic
279	   ARK (prepending "http://" and an NMAH) is itself temporary.  Assuming
280	   that the reign of [HTTP] in information retrieval will end one day,
281	   ARKs will have to be converted into new kinds of actionable identi�
282	   fiers.  In any event, if ARKs see widespread use, web browsers would
283	   presumably evolve to perform this (currently simple) transformation
284	   automatically.

286	2.2.  The Name Assigning Authority Number (NAAN)

288	   The part of the ARK directly following the "ark:" is the Name
289	   Assigning Authority Number (NAAN) enclosed in `/' (slash) characters.
290	   This part is always required, as it identifies the organization that
291	   originally assigned the Name of the object.  It is used to discover a
292	   currently valid NMAH and to provide top-level partitioning of the
293	   space of all ARKs.  NAANs are registered in a manner similar to URN
294	   Namespaces, but they are pure numbers consisting of 5 digits or 9
295	   digits.  Thus, the first 100,000 registered NAAs fit compactly into
296	   the 5 digits, and if growth warrants, the next billion fit into the 9
297	   digit form.  In either case the fixed odd number of digits helps
298	   reduce the chances of finding a NAAN out of context and confusing it
299	   with nearby quantities such as 4-digit dates.

301	2.3.  The Name Part

303	   The final part of the ARK is the Name assigned by the NAA, and it is
304	   also required.  The Name is a string of visible ASCII characters and
305	   should be less than 128 bytes in length.  The length restriction
306	   keeps the ARK short enough to append ordinary ARK request strings
307	   without running into transport restrictions within HTTP GET requests.
308	   Characters may be letters, digits, or any of these six characters:

310	         =   @   $   _   *   +   #

312	   The following characters may also be used, but in limited ways:

314	         /   .   -   %

316	   The characters `/' and `.' are ignored if either appears as the last
317	   character of an ARK.  If used internally, they allow a name assigning
318	   authority to reveal object hierarchy and object variants as described
319	   in the next two sections.

321	   A `-' (hyphen) may appear in an ARK, but must be ignored in lexical
322	   comparisons.  The `%' character is reserved for %-encoding all other
323	   octets that would appear in the ARK string, in the same manner as for
324	   URIs [URI].  A %-encoded octet consists of a `%' followed by two hex
325	   digits; for example, "%7d" stands in for `}'.  Lower case hex digits
326	   are preferred to reduce the chances of false acronym recognition;
327	   thus it is better to use "%acT" instead of "%ACT".  The character `%'
328	   itself must be represented using "%25".  As with URNs, %-encoding
329	   permits ARKs to support legacy namespaces (e.g., ISBN, ISSN, SICI)
330	   that have less restricted character repertoires [URNBIB].

332	   The creation of names that include linguistically based constructs
333	   (having recognizable meaning from natural language) is strongly dis�
334	   couraged if long-term persistence is a naming priority.  Such names
335	   do not age or travel well.  Names that look more or less like numbers
336	   avoid common problems that defeat persistence and international
337	   acceptance.  The use of digits is highly recommended.  Mixing in non-
338	   vowel alphabetic characters is a relatively safe and easy way to
339	   achieve more compact names, although any character repertoire can
340	   work if potentially troublesome names will be discarded during a
341	   screening process.  More on naming considerations is given in a later
342	   section.

344	2.3.1.  Names that Reveal Object Hierarchy

346	   A name assigning authority may choose to reveal the presence of a
347	   hierarchical relationship between objects using the `/' (slash)
348	   character in the Name part of an ARK.  If the Name contains an
349	   internal slash, the piece to its left indicates a containing object.
350	   For example, publishing an ARK of the form,

352	                         ark:/12025/654/xz/321

354	   is equivalent to publishing three ARKs,

356	                         ark:/12025/654/xz/321
357	                         ark:/12025/654/xz
358	                         ark:/12025/654

360	   together with a declaration that the first object is contained in the
361	   second object, and that the second object is contained in the third.

363	   Revealing the presence of hierarchy is completely up to the assigning
364	   authority.  It is hard enough to commit to one object's name, let
365	   alone to three objects' names and to a specific, ongoing relatedness
366	   among them.  Thus, regardless of whether hierarchy was present ini�
367	   tially, the assigning authority, by not using slashes, reveals no
368	   shared inferences about hierarchical or other inter-relatedness in
369	   the following ARKs:

371	                         ark:/12025/654_xz_321
372	                         ark:/12025/654_xz
373	                         ark:/12025/654xz321
374	                         ark:/12025/654xz
375	                         ark:/12025/654

377	   Note that slashes around the ARK's NAAN (/12025/ in these examples)
378	   are not part of the ARK's Name and therefore do not indicate the
379	   existence of some sort of NAAN super object containing all objects in
380	   its namespace.  A slash must have at least one non-structural charac�
381	   ter (one that is neither a slash nor a period) on both sides in order
382	   for it to separate recognizable structural components.  So initial or
383	   final slashes may be removed, and double slashes may be converted
384	   into single slashes.

386	2.3.2.  Names that Reveal Object Variants

388	   A name assigning authority may choose to reveal the possible presence
389	   of variant objects using the `.' (period) character in the Name part
390	   of an ARK.  If the Name contains an internal period, the piece to its
391	   left is a base name and the piece to its right up to the end of the
392	   ARK or to the next period is a suffix.  A Name may have more than one
393	   suffix, for example,

395	                         ark:/12025/654.24
396	                         ark:/12025/xz4/654.24
397	                         ark:/12025/654.f55.g78.v20

399	   There are two main rules.  First, if two ARKs share the same base
400	   name but have different suffixes, the corresponding objects were con�
401	   sidered variants of each other (different formats, languages, ver�
402	   sions, etc.) by the assigning authority.  Thus, the following ARKs
403	   are variants of each other:

405	                         ark:/12025/654.f55.g78.v20
406	                         ark:/12025/654.321xz
407	                         ark:/12025/654.44

409	   Second, publishing an ARK with a suffix implies the existence of at
410	   least one variant identified by the ARK without its suffix.  The ARK
411	   otherwise permits no further assumptions about what variants might
412	   exist.  So publishing the ARK,

414	                         ark:/12025/654.f55.g78.v20

416	   is equivalent to publishing the four ARKs,
417	                         ark:/12025/654.f55.g78.v20
418	                         ark:/12025/654.f55.g78
419	                         ark:/12025/654.f55
420	                         ark:/12025/654

422	   Revealing the possibility of variants is completely up to the assign�
423	   ing authority.  It is hard enough to commit to one object's name, let
424	   alone to multiple variants' names and to a specific, ongoing related�
425	   ness among them.  The assigning authority is the sole arbiter of what
426	   constitutes a variant within its namespace, and whether to reveal
427	   that kind of relatedness by using periods within its names.

429	   A period must have at least one non-structural character (one that is
430	   neither a slash nor a period) on both sides in order for it to sepa�
431	   rate recognizable structural components.  So initial or final periods
432	   may be removed, and double periods may be converted into single peri�
433	   ods.  Multiple suffixes should be arranged in sorted order (pure
434	   ASCII collating sequence) at the end of an ARK.

436	2.3.3.  Hyphens are Ignored

438	   Hyphens are always ignored in ARKs.  Hyphens may be added to an ARK's
439	   Name part for readability, or during the formatting and wrapping of
440	   text lines, but (as in phone numbers) they are treated as if they
441	   were not present.  Thus, like the NMAH, hyphens are "identity inert"
442	   in comparing ARKs for equivalence.  For example, the following ARKs
443	   are equivalent for purposes of comparison and ARK service access:

445	                                    ark:/12025/65-4-xz-321
446	                    ark:sneezy.dopey.com/12025/654--xz32-1
447	                                    ark:/12025/654xz321

449	2.4.  Normalization and Lexical Equivalence

451	   To determine if two or more ARKs identify the same object, the ARKs
452	   are compared for lexical equivalence after first being normalized.
453	   Since ARK strings may appear in various forms (e.g., having different
454	   NMAHs), normalizing them minimizes the chances that comparing two ARK
455	   strings for equality will fail unless they actually identify
456	   different objects.  In a specified-host ARK (one having an NMAH), the
457	   NMAH never participates in such comparisons.

459	   Normalization of an ARK for the purpose of octet-by-octet equality
460	   comparison with another ARK consists of four steps.  First, any upper
461	   case letters in the "ark:" label and the two characters following a
462	   `%' are converted to lower case.  The case of all other letters in
463	   the ARK string must be preserved.  Second, any NMAH part is removed
464	   (everything from an initial "http://" up to the next slash) and all
465	   hyphens are removed.

467	   Third, structural characters (slash and period) are normalized.
468	   Initial and final occurrences are removed, and two structural
469	   characters in a row (e.g., // or ./) are replaced by the first
470	   character, iterating until each occurrence has at least one non-
471	   structural character on either side.  Finally, if there are any
472	   components with a period on the left and a slash on the right, either
473	   the component and the preceding period must be moved to the end of
474	   the Name part or the ARK must be thrown out as malformed.

476	   The fourth and final step is to arrange the suffixes in ASCII
477	   collating sequence (that is, to sort them) and to remove duplicate
478	   suffixes, if any.  It is also permissible to throw out ARKs for which
479	   the suffixes are not sorted.

481	   The resulting ARK string is now normalized.  Comparisons between
482	   normalized ARKs are case-sensitive, meaning that upper case letters
483	   are considered different from their lower case counterparts.

485	   To keep ARK string variation to a minimum, no reserved ARK characters
486	   should be %-encoded unless it is deliberately to conceal their
487	   reserved meanings.  No non-reserved ARK characters should ever be
488	   %-encoded.  Finally, no %-encoded character should ever appear in an
489	   ARK in its decoded form.

491	2.5.  Naming Considerations

493	   The ARK has different goals from the URI, so it has different
494	   character set requirements.  Because linguistic constructs imperil
495	   persistence, for ARKs non-ASCII character support is unimportant.
496	   ARKs and URIs share goals of transcribability and transportability
497	   within web documents, so characters are required to be visible, non-
498	   conflicting with HTML/XML syntax, and not subject to tampering during
499	   transmission across common transport gateways.  Add the goal of
500	   making an undelimited ARK recognizable in running prose, as in
501	   ark:/12025/=@_22*$, and certain punctuation characters (e.g., comma,
502	   period) end up being excluded from the ARK lest the end of a phrase
503	   or sentence be mistaken for part of the ARK.

505	   A valuable technique for provision of persistent objects is to try to
506	   arrange for the complete identifier to appear on, with, or near its
507	   retrieved object.  An object encountered at a moment in time when its
508	   discovery context has long since disappeared could then easily be
509	   traced back to its metadata, to alternate versions, to updates, etc.
510	   This has seen reasonable success, for example, in book publishing and
511	   software distribution.

513	   If persistence is the goal, a deliberate local strategy for
514	   systematic name assignment is crucial.  Names must be chosen with
515	   great care.  Poorly chosen and managed names will devastate any
516	   persistence strategy, and they do not discriminate based on naming
517	   scheme.  Whether a mistakenly re-assigned identifier is a URN, DOI,
518	   PURL, URL, or ARK, the damage -- failed access and confusion -- is
519	   not mitigated more in one scheme than in another.  Conversely, in-
520	   house efforts to manage names responsibly will go much further
521	   towards safeguarding persistence than any choice of naming scheme or
522	   name resolution technology.

524	   Hostnames appearing in any identifier meant to be persistent must be
525	   chosen with extra care.  The tendency in hostname selection has
526	   traditionally been to choose a token with recognizable attributes,
527	   such as a corporate brand, but that tendency wreaks havoc with
528	   persistence that is to outlive brands, corporations, subject
529	   classifications, and natural language semantics (e.g., what did the
530	   three letters "gay" mean 1958, 1978, and 1998?).  Today's recognized
531	   and correct attributes are tomorrow's stale or incorrect attributes.
532	   In making hostnames (any names, actually) long-term persistent, it
533	   helps to eliminate recognizable attributes to the extent possible.
534	   This affects selection of any name based on URLs, including PURLs and
535	   the explicitly disposable NMAHs.  There is no excuse for a provider
536	   that manages its internal names impeccably not to exercise the same
537	   care in choosing what could be an exceptionally durable hostname,
538	   especially if it would form the prefix for all the provider's URL-
539	   based external names.  Registering an opaque hostname in the ".org"
540	   or ".net" domain would not be a bad start.

542	   Dubious persistence speculation does not make selecting naming
543	   strategies any easier.  For example, despite rumors to the contrary,
544	   there are really no obvious reasons why the organizations registering
545	   DNS names, URN Namespaces, and DOI publisher IDs should have among
546	   them one that is intrinsically more fallible than the next.
547	   Moreover, it is a misconception that the demise of DNS and of HTTP
548	   need adversely affect the persistence of URLs.  At such a time,
549	   certainly URLs from the present day might not then be actionable by
550	   our present-day mechanisms, but resolution systems for future non-
551	   actionable URLs are no harder to imagine than resolution systems for
552	   present-day non-actionable URNs and DOIs.  There is no more stable a
553	   namespace than one that is dead and frozen, and that would then
554	   characterize the space of names bearing the "http://" prefix.  It is
555	   useful to  remember that just because hostnames have been carelessly
556	   chosen in their brief history does not mean that they are unsuitable
557	   in NMAHs (and URLs) intended for use in situations demanding the
558	   highest level of persistence available in the Internet environment.
559	   A well-planned name assignment strategy is everything.

561	3.  Assigners of ARKs

563	   A Name Assigning Authority (NAA) is an organization that creates (or
564	   delegates creation of) long-term associations between identifiers and
565	   information objects.  Examples of NAAs include national libraries,
566	   national archives, and publishers.  An NAA may arrange with an
567	   external organization for identifier assignment.  The US Library of
568	   Congress, for example, allows OCLC (the Online Computer Library
569	   Center, a major world cataloger of books) to create associations
570	   between Library of Congress call numbers (LCCNs) and the books that
571	   OCLC processes.  A cataloging record is generated that testifies to
572	   each association, and the identifier is included by the publisher,
573	   for example, in the front matter of a book.

575	   An NAA does not so much create an identifier as create an
576	   association.  The NAA first draws an unused identifier string from
577	   its namespace, which is the set of all identifiers under its control.
578	   It then records the assignment of the identifier to an information
579	   object having sundry witnessed characteristics, such as a particular
580	   author and modification date.  A namespace is usually reserved for an
581	   NAA by agreement with recognized community organizations (such as
582	   IANA and ISO) that all names containing a particular string be under
583	   its control.  In the ARK an NAA is represented by the Name Assigning
584	   Authority Number (NAAN).

586	   The ARK namespace reserved for an NAA is the set of names bearing its
587	   particular NAAN.  For example, all strings beginning with
588	   "ark:/12025/" are under control of the NAA registered under 12025,
589	   which might be the National Library of Finland.  Because each NAA has
590	   a different NAAN, names from one namespace cannot conflict with those
591	   from another.  Each NAA is free to assign names from its namespace
592	   (or delegate assignment) according to its own policies.  These
593	   policies must be documented in a manner similar to the declarations
594	   required for URN Namespace registration [URNNID].

596	   For now, registration of ARK NAAs is in a bootstrapping phase.  To
597	   register, please read about the mapping authority discovery file in
598	   the next section and send email to jak@ucop.edu.

600	4.  Finding a Name Mapping Authority

602	   In order to derive an actionable identifier (these days, a URL) from
603	   an ARK, a hostport (hostname or hostname plus port combination) for a
604	   working Name Mapping Authority (NMA) must be found.  An NMA is a
605	   service that is able to respond to the three basic ARK service
606	   requests.  Relying on registration and client-side discovery, NMAs
607	   make known which NAAs' identifiers they are willing to service.

609	   Upon encountering an ARK, a user (or client software) looks inside it
610	   for the optional NMAH part (the hostport of the NMA's ARK service).
611	   If it contains an NMAH that is working, this NMAH discovery step may
612	   be skipped; the NMAH effectively uses the beginning of an ARK to
613	   cache the results of a prior mapping authority discovery process.  If
614	   a new NMAH needs to found, the client looks inside the ARK again for
615	   the NAAN (Name Assigning Authority Number).  Querying a global
616	   database, it then uses the NAAN to look up all current NMAHs that
617	   service ARKs issued by the identified NAA.  The global database is
618	   key, and two specific methods for querying it are given in this
619	   section.

621	   In the interests of long-term persistence, however, ARK mechanisms
622	   are first defined in high-level, protocol-independent terms so that
623	   mechanisms may evolve and be replaced over time without compromising
624	   fundamental service objectives.  Either or both specific methods
625	   given here may eventually be supplanted by better methods since, by
626	   design, the ARK scheme does not depend on a particular method, but
627	   only on having some method to locate an active NMAH.

629	   At the time of issuance, at least one NMAH for an ARK should be
630	   prepared to service it.  That NMA may or may not be administered by
631	   the Name Assigning Authority (NAA) that created it.  Consider the
632	   following hypothetical example of providing long-term access to a
633	   cancer research journal.  The publisher wishes to turn a profit and
634	   the National Library of Medicine wishes to preserve the scholarly
635	   record.  An agreement might be struck whereby the publisher would act
636	   as the NAA and the national library would archive the journal issue
637	   when it appears, but without providing direct access for the first
638	   six months.  During the first six months of peak commercial
639	   viability, the publisher would retain exclusive delivery rights and
640	   would charge access fees.  Again, by agreement, both the library and
641	   the publisher would act as NMAs, but during that initial period the
642	   library would redirect requests for issues less than six months old
643	   to the publisher.  At the end of the waiting period, the library
644	   would then begin servicing requests for issues older than six months
645	   by tapping directly into its own archives.  Meanwhile, the publisher
646	   might routinely redirect incoming requests for older issues to the
647	   library.  Long-term access is thereby preserved, and so is the
648	   commercial incentive to publish content.

650	   There is never a requirement that an NAA also run an NMA service,
651	   although it seems not an unlikely scenario.  Over time NAAs and NMAs
652	   would come and go.  One NMA would succeed another, and there might be
653	   many NMAs serving the same ARKs simultaneously (e.g., as mirrors or
654	   as competitors).  There might also be asymmetric but coordinated NMAs
655	   as in the library-publisher example above.

657	4.1.  Looking Up NMAHs in a Globally Accessible File

659	   This subsection describes a way to look up NMAHs using a simple text
660	   file.  For efficient access the file may be stored in a local
661	   filesystem, but it needs to be reloaded periodically to incorporate
662	   updates.  It is not expected that the size of the file or frequency
663	   of update should impose an undue maintenance or searching burden any
664	   time soon, for even primitive linear search of a file with ten-
665	   thousand NAAs is a subsecond operation on modern server machines.
666	   The proposed file strategy is similar to the /etc/hosts file strategy
667	   that supported Internet host address lookup for a period of years
668	   before the advent of the Domain Name System [DNS].

670	   A copy of the current file (at the time of writing) appears in an
671	   appendix and is available on the web.  A minimal version of the file
672	   appears below.  Comment lines (lines that begin with `#') explain the
673	   format and give the file's modification time, reloading address, and
674	   NAA registration instructions.  There is even a Perl script that
675	   processes the file embedded in the file's comments.  Because this is
676	   still a proposed file, none of the values in it are real.

678	         #
679	         # Name Assigning Authority / Name Mapping Authority Lookup Table
680	         #       Last change:   31 July 2003
681	         #       Reload from:   http://ark.nlm.nih.gov/etc/natab
682	         #       Mirrored at:   http://ark.cdlib.org/natab
683	         #       To register:   mailto:jak@ucop.edu?Subject=naareg
684	         #       Process with:  Perl script at end of this file (optional)
685	         #
686	         # Each NAA appears at the beginning of a line with the NAA Number
687	         # first, a colon, and an ARK or URL to a statement of naming policy
688	         # (see http://ark.cdlib.org for an example).
689	         # All the NMA hostports that service an NAA are listed, one per
690	         # line, indented, after the corresponding NAA line.
691	         #
692	         #       National Library of Medicine
693	         12025:  http://www.nlm.nih.gov/xxx/naapolicy.html
694	                 ark.nlm.nih.gov USNLM
695	                 foobar.zaf.org UCSF
696	                 sneezy.dopey.com BIREME
697	         #
698	         #       Library of Congress
699	         12026:  http://www.loc.gov/xxx/naapolicy.html
700	                 foobar.zaf.org USLC
701	                 sneezy.dopey.com USLC
702	         #
703	         #       National Agriculture Library
704	         12027:  http://www.nal.gov/xxx/naapolicy.html
705	                 foobar.zaf.gov:80 USNAL
706	         #
707	         #       University of California
708	         13030:  http://ark.cdlib.org/
709	                 ark.cdlib.org CDL
710	         #
711	         #       World Intellectual Property Organization
712	         13038:  http://www.wipo.int/xxx/naapolicy.html
713	                 www.wipo.int WIPO
714	         #
715	         #--- end of data ---
716	         # The following Perl script takes an NAA as argument and outputs
717	         # the NMAs in this file listed under any matching NAA.
718	         #
719	         # my $naa = shift;
720	         # while (<>) {
721	         #       next if (! /^$naa:/);
722	         #       while (<>) {
723	         #               last if (! /^[#\s]./);
724	         #               print "$1\n" if (/^\s+(\S+)/);
725	         #       }
726	         # }
727	         #
728	         # Create a g/t/nroff-safe version of this table with the UNIX command,
729	         #
730	         #       expand natab | sed 's/\\/\\\e/g' > natab.roff
731	         #
732	         # end of file

734	4.2.  Looking up NMAHs Distributed via DNS

736	   This subsection introduces a method for looking up NMAHs that is
737	   based on the method for discovering URN resolvers described in
738	   [NAPTR].  It relies on querying the DNS system already installed in
739	   the background infrastructure of most networked computers.  A query
740	   is submitted to DNS asking for a list of resolvers that match a given
741	   NAAN.  DNS distributes the query to the particular DNS servers that
742	   can best provide the answer, unless the answer can be found more
743	   quickly in a local DNS cache as a side-effect of a recent query.
744	   Responses come back inside Name Authority Pointer (NAPTR) records.
745	   The normal result is one or more candidate NMAHs.

747	   In its full generality the [NAPTR] algorithm ambitiously accommodates
748	   a complex set of preferences, orderings, protocols, mapping services,
749	   regular expression rewriting rules, and DNS record types.  This
750	   subsection proposes a drastic simplification of it for the special
751	   case of ARK mapping authority discovery.  The simplified algorithm is
752	   called Maptr.  It uses only one DNS record type (NAPTR) and restricts
753	   most of its field values to constants.  The following hypothetical
754	   excerpt from a DNS data file for the NAAN known as 12026 shows three
755	   example NAPTR records ready to use with the Maptr algorithm.

757	       12026.ark.arpa.
758	       ;; US Library of Congress
759	       ;;       order pref flags service regexp  replacement
760	        IN NAPTR  0     0   "h"  "ark"   "USLC"  lhc.nlm.nih.gov:8080
761	        IN NAPTR  0     0   "h"  "ark"   "USLC"  foobar.zaf.org
762	        IN NAPTR  0     0   "h"  "ark"   "USLC"  sneezy.dopey.com

764	   All the fields are held constant for Maptr except for the "flags",
765	   "regexp", and "replacement" fields.  The "service" field contains the
766	   constant value "ark" so that NAPTR records participating in the Maptr
767	   algorithm will not be confused with other NAPTR records.  The "order"
768	   and "pref" fields are held to 0 (zero) and otherwise ignored for now;
769	   the algorithm may evolve to use these fields for ranking decisions
770	   when usage patterns and local administrative needs are better under�
771	   stood.

773	   When a Maptr query returns a record with a flags field of "h" (for
774	   hostport, a Maptr extension to the NAPTR flags), the replacement
775	   field contains the NMAH (hostport) of an ARK service provider.  When
776	   a query returns a record with a flags field of "" (the empty string),
777	   the client needs to submit a new query containing the domain name
778	   found in the replacement field.  This second sort of record exploits
779	   the distributed nature of DNS by redirecting the query to another
780	   domain name.  It looks like this.

782	       12345.ark.arpa.
783	       ;; Digital Library Consortium
784	       ;;       order pref flags service regexp replacement
785	        IN NAPTR  0     0    ""  "ark"     ""   dlc.spct.org.

787	   Here is the Maptr algorithm for ARK mapping authority discovery.  In
788	   it replace <NAAN> with the NAAN from the ARK for which an NMAH is
789	   sought.

791	        (1) Initialize the DNS query:  type=NAPTR,
792	        query=<NAAN>.ark.arpa.

794	        (2) Submit the query to DNS and retrieve (NAPTR) records, dis�
795	        carding any record that does not have "ark" for the service
796	        field.

798	        (3) All remaining records with a flags fields of "h" contain
799	        candidate NMAHs in their replacement fields.  Set them aside, if
800	        any.

802	        (4) Any record with an empty flags field ("") has a replacement
803	        field containing a new domain name to which a subsequent query
804	        should be redirected.  For each such record, set query=<replace�
805	        ment> then go to step (2).  When all such records have been
806	        recursively exhausted, go to step (5).

808	        (5) All redirected queries have been resolved and a set of can�
809	        didate NMAHs has been accumulated from steps (3).  If there are
810	        zero NMAHs, exit -- no mapping authority was found.  If there is
811	        one or more NMAH, choose one using any criteria you wish, then
812	        exit.

814	   A Perl script that implements this algorithm is included here.

816	     #!/depot/bin/perl

818	     use Net::DNS;                 # include simple DNS package
819	     my $qtype = "NAPTR";               # initialize query type
820	     my $naa = shift;              # get NAAN script argument
821	     my $mad = new Net::DNS::Resolver;  # mapping authority discovery

823	     &maptr("$naa.ark.arpa");      # call maptr - that's it

825	     sub maptr {                   # recursive maptr algorithm
826	          my $dname = shift;       # domain name as argument
827	          my ($rr, $order, $pref, $flags, $service, $regexp,
828	               $replacement);
829	          my $query = $mad->query($dname, $qtype);
830	          return                   # non-productive query
831	               if (! $query || ! $query->answer);
832	          foreach $rr ($query->answer) {
833	               next           # skip records of wrong type
834	                    if ($rr->type ne $qtype);
835	               ($order, $pref, $flags, $service, $regexp,
836	                    $replacement) = split(/\s/, $rr->rdatastr);
837	               if ($flags eq "") {
838	                    &maptr($replacement);    # recurse
839	               } elsif ($flags eq "h") {
840	                    print "$replacement\n";  # candidate NMAH
841	               }
842	          }
843	     }

845	   The global database thus distributed via DNS and the Maptr algorithm
846	   can easily be seen to mirror the contents of the Name Authority Table
847	   file described in the previous section.

849	5.  Generic ARK Service Definition

851	   An ARK request's output is delivered information; examples include
852	   the object itself, a policy declaration (e.g., a promise of support),
853	   a descriptive metadata record, or an error message.  ARK services
854	   must be couched in high-level, protocol-independent terms if
855	   persistence is to outlive today's networking infrastructural
856	   assumptions.  The high-level ARK service definitions listed below are
857	   followed in the next section by a concrete method (one of many
858	   possible methods) for delivering these services with today's
859	   technology.

861	5.1.  Generic ARK Access Service (access, location)

863	   Returns (a copy of) the object or a redirect to the same, although a
864	   sensible object proxy may be substituted.  Examples of sensible
865	   substitutes include,
866	     - a table of contents instead of a large complex document,
867	     - a home page instead of an entire web site hierarchy,
868	     - a rights clearance challenge before accessing protected data,
869	     - directions for access to an offline object (e.g., a book),
870	     - a description of an intangible object (a disease, an event), or
871	     - an applet acting as "player" for a large multimedia object.

873	   May also return a discriminated list of alternate object locators.
874	   If access is denied, returns an explanation of the object's current
875	   (perhaps permanent) inaccessibility.

877	5.2.  Generic Policy Service (permanence, naming, etc.)

879	   Returns declarations of policy and support commitments for given
880	   ARKs.  Declarations are returned in either a structured metadata
881	   format or a human readable text format; sometimes one format may
882	   serve both purposes.  Policy subareas may be addressed in separate
883	   requests, but the following areas should should be covered:  object
884	   permanence, object naming, object fragment addressing, and
885	   operational service support.

887	   The permanence declaration for an object is a rating defined with
888	   respect to an identified permanence provider (guarantor), and may
889	   include the following aspects.  One permanence rating framework is
890	   given in [NLMPerm].

892	        (a) "object availability" -- whether and how access to the
893	        object is supported (e.g., online 24x7, or offline only),

895	        (b) "identifier validity" -- under what conditions the
896	        identifier will be or has been re-assigned,

898	        (c) "content invariance" -- under what conditions the content of
899	        the object is subject to change, and

901	        (d) "change history" -- documentation, whether abbreviated or
902	        detailed, of any or all corrections, migrations, revisions, etc.

904	   Naming policy for an object includes an historical description of the
905	   NAA's (and its successor NAA's) policies regarding differentiation of
906	   objects.  It may include the following aspects.

908	        (e) "similarity" -- (or "unity") the limit, defined by the NAA,
909	        to the level of dissimilarity beyond which two similar objects
910	        warrant separate identifiers but before which they share one
911	        single identifier, and

913	        (f) "granularity" -- the limit, defined by the NAA, to the level
914	        of object subdivision beyond which sub-objects do not warrant
915	        separately assigned identifiers but before which sub-objects are
916	        assigned separate identifiers.

918	   Addressing policy for an object includes a description of how, during
919	   access, object components (e.g., paragraphs, sections) or views
920	   (e.g., image conversions) may or may not be "addressed", in other
921	   words, how the NMA permits arguments or parameters to modify the
922	   object delivered as the result of an ARK request.  If supported,
923	   these sorts of operations would provide things like byte-ranged
924	   fragment delivery and open-ended format conversions, or any set of
925	   possible transformations that would be too numerous to list or to
926	   identify with separately assigned ARKs.

928	   Operational service support policy includes a description of general
929	   operational aspects of the NMA service, such as after-hours staffing
930	   and trouble reporting procedures.

932	5.3.  Generic Description Service

934	   Returns a description of the object.  Descriptions are returned in
935	   either a structured metadata format or a human readable text format;
936	   sometimes one format may serve both purposes.  A description must at
937	   a minimum answer the who, what, when, and where questions concerning
938	   an expression of the object.  Standalone descriptions should be
939	   accompanied by the modification date and source of the description
940	   itself.  May also return discriminated lists of ARKs that are related
941	   to the given ARK.

943	6.  Overview of the HTTP Key Mapping Protocol (HKMP)

945	   The HTTP Key Mapping Protocol (HKMP) is a way of taking a key (a kind
946	   of identifier) and asking such questions as, what information does
947	   this identify and how permanent is it?  [HKMP] is in fact one
948	   specific method under development for delivering ARK services.  The
949	   protocol runs over HTTP to exploit the web browser's current pre-
950	   eminence as user interface to the Internet.  HKMP is designed so that
951	   a person can enter ARK requests directly into the location field of
952	   current browser interfaces.  Because it runs over HTTP, HKMP can be
953	   simulated and tested within keyboard-based [TELNET] sessions.

955	   The asker (a person or client program) starts with an identifier,
956	   such as an ARK or a URL.  The identifier reveals to the asker (or
957	   allows the asker to infer) the Internet host name and port number of
958	   a server system that responds to questions.  Here, this is just the
959	   NMAH that is obtained by inspection and possibly lookup based on the
960	   ARK's NAAN.  The asker then sets up an HTTP session with the server
961	   system, sends a question via an HKMP request (contained within an
962	   HTTP request), receives an answer via an HKMP response (contained
963	   within an HTTP response), and closes the session.  That concludes the
964	   connected portion of the protocol.

966	   An HKMP request is a string of characters beginning with a `?'
967	   (question mark) that is appended to the identifier string.  The
968	   resulting string is sent as an argument to HTTP's GET command.

970	   Request strings too long for GET may be sent using HTTP's POST
971	   command.  The three most common requests correspond to three
972	   degenerate special cases that keep the user's learning and typing
973	   burden low.  First, a simple key with no request at all is the same
974	   as an ordinary access request.  Thus a plain ARK entered into a
975	   browser's location field behaves much like a plain URL, and returns
976	   access to the primary identified object, for instance, an HTML
977	   document.

979	   The second special case is a minimal ARK description request string
980	   consisting of just "?".  For example, entering the string,

982	             ark.nlm.nih.gov/12025/psbbantu?

984	   into the browser's location field directly precipitates a request for
985	   a metadata record describing the object identified by ark:/12025/psb�
986	   bantu.  The browser, unaware of HKMP, prepares and sends an HTTP GET
987	   request in the same manner as for a URL.  HKMP is designed so that
988	   the response (indicated by the returned HTTP content type) is nor�
989	   mally displayed, whether the output is structured for machine pro�
990	   cessing (text/plain) or formatted for human consumption (text/html).

992	   In the following example HKMP session, each line has been annotated
993	   to include a line number and whether it was the client or server that
994	   sent it.  Without going into much depth, the session has four pieces
995	   separated from each other by blank lines:  the client's piece (lines
996	   1-3), the server's HTTP/HKMP response headers (4-7), and the body of
997	   the server's response (8-17).  The first and last lines (1 and 17)
998	   correspond to the client's steps to start the TCP session and the
999	   server's steps to end it, respectively.

1001	      1  C: [opens session]
1002	         C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu? HTTP/1.1
1003	         C:
1004	         S: HTTP/1.1 200 OK
1005	      5  S: Content-Type: text/plain
1006	         S: HKMP-Status: 0.1 200 OK
1007	         S:
1008	         S: |set: NLM | 12025/psbbantu? | 2003 07 31
1009	         S:         http://ark.nlm.nih.gov/ark:/12025/psbbantu?
1010	     10  S: here: 1 | 1 | 1
1011	         S:
1012	         S: erc:
1013	         S: who:    Lederberg, Joshua
1014	         S: what:   Studies of Human Families for Genetic Linkage
1015	     15  S: when:   1974
1016	         S: where:  http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf
1017	         S: [closes session]

1019	   The first two server response lines (4-5) above are typical of HTTP.
1020	   The next line (6) is peculiar to HKMP, and indicates the HKMP version
1021	   and a normal return status.  The balance of the response consists of
1022	   a record set header (lines 8-10) and a single metadata record (12-16)
1023	   that comprises the ARK description service response.  The record set
1024	   header identifies (8-9) who created the set, what its title is, when
1025	   it was created, and where an automated process can access the set,
1026	   and ends in a line (10) indicating that here in this communication
1027	   the recipient can expect to find records numbered 1 to 1 of a total
1028	   of 1 record in the set (i.e., here is the entire set, consisting of
1029	   exactly one record).

1031	   The returned record (12-16) is in the format of an Electronic
1032	   Resource Citation [ERC], which is discussed in more detail in the
1033	   next section.  For now, note that it contains four elements that
1034	   answer the top priority questions regarding an expression of the
1035	   object:  who played a major role in expressing it, what the expres�
1036	   sion was called, when is was created, and where the expression may be
1037	   found.  This quartet of elements comes up again and again in ERCs.

1039	   The third degenerate special case of an ARK request (and no other
1040	   cases will be described in this document) is the string "??", corre�
1041	   sponding to a minimal permanence policy request.  It can be seen in
1042	   use appended to an ARK (on line 2) in the example session that fol�
1043	   lows.

1045	      1  C: [opens session]
1046	         C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu?? HTTP/1.1
1047	         C:
1048	         S: HTTP/1.1 200 OK
1049	      5  S: Content-Type: text/plain
1050	         S: HKMP-Status: 0.1 200 OK
1051	         S:
1052	         S: |set: NLM | 12025/psbbantu?? | 2003 07 31
1053	         S:         http://ark.nlm.nih.gov/ark:/12025/psbbantu??
1054	     10  S: here: 1 | 1 | 1
1055	         S:
1056	         S: erc:
1057	         S: who:    Lederberg, Joshua
1058	         S: what:   Studies of Human Families for Genetic Linkage
1059	     15  S: when:   1974
1060	         S: where:  http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf
1061	         S: erc-support:
1062	         S: who:    USNLM
1063	         S: what:   Permanent, Unchanging Content
1064	     20  S: when:   2001 04 21
1065	         S: where:  http://ark.nlm.nih.gov/yy22948
1066	         S: [closes session]

1068	   Again, a single metadata record (lines 12-21) is returned, but it
1069	   consists of two segments.  The first segment (12-16) gives the same
1070	   basic citation information as in the previous example.  It is
1071	   returned in order to establish context for the persistence
1072	   declaration in the second segment (17-21).

1074	   Each segment in an ERC tells a different story relating to the
1075	   object, so although the same four questions (elements) appear in
1076	   each, the answers depend on the segment's story type.  While the
1077	   first segment tells the story of an expression of the object, the
1078	   second segment tells the story of the support commitment made to it:
1079	   who made the commitment, what the nature of the commitment was, when
1080	   it was made, and where a fuller explanation of the commitment may be
1081	   found.

1083	7.  Overview of Electronic Resource Citations (ERCs)

1085	   An Electronic Resource Citation (or ERC, pronounced e-r-c) [ERC] is a
1086	   simple, compact, and printable record designed to hold data
1087	   associated with an information resource.  By design, the ERC is a
1088	   metadata format that balances the needs for expressive power, very
1089	   simple machine processing, and direct human manipulation.

1091	   A founding principle of the ERC is that direct human contact with
1092	   metadata will be a necessary and sufficient condition for the near
1093	   term rapid development of metadata standards, systems, and services.
1094	   Thus the machine-processable ERC format must only minimally strain
1095	   people's ability to read, understand, change, and transmit ERCs
1096	   without their relying on intermediation with specialized software
1097	   tools.  The basic ERC needs to be succinct, transparent, and
1098	   trivially parseable by software.

1100	   In the current Internet, it is natural seriously to consider using
1101	   XML as an exchange format because of predictions that it will obviate
1102	   many ad hoc formats and programs, and unify much of the world's
1103	   information under one reliable data structuring discipline that is
1104	   easy to generate, verify, parse, and render.  It appears, however,
1105	   that XML is still only catching on after years of standards work and
1106	   implementation experience.  The reasons for it are unclear, but for
1107	   now very simple XML interpretation is still out of reach.  Another
1108	   important caution is that XML structures are hard on the eyeballs,
1109	   taking up an amount of display (and page) space that significantly
1110	   exceeds that of traditional formats.  Until these conflicts with ERC
1111	   principle are resolved, XML is not a first choice for representing
1112	   ERCs.  Borrowing instead from the data structuring format that
1113	   underlies the successful spread of email and web services, the first
1114	   ERC format is based on email and HTTP headers (RFC822) [EMHDRS].
1115	   There is a naturalness to its label-colon-value format (seen in the
1116	   previous section) that barely needs explanation to a person beginning
1117	   to enter ERC metadata.

1119	   Besides simplicity of ERC system implementation and data entry
1120	   mechanics, ERC semantics (what the record and its constituent parts
1121	   mean) must also be easy to explain.  ERC semantics are based on a
1122	   reformulation and extension of the Dublin Core [DCORE] hypothesis,
1123	   which suggests that the fifteen Dublin Core metadata elements have a
1124	   key role to play in cross-domain resource description.  The ERC
1125	   design recognizes that the Dublin Core's primary contribution is the
1126	   international, interdisciplinary consensus that identified fifteen
1127	   semantic buckets (element categories), regardless of how they are
1128	   labeled.  The ERC then adds a definition for a record and some
1129	   minimal compliance rules.  In pursuing the limits of simplicity, the
1130	   ERC design combines and relabels some Dublin Core buckets to isolate
1131	   a tiny kernel (subset) of four elements for basic cross-domain
1132	   resource description.

1134	   For the cross-domain kernel, the ERC uses the four basic elements --
1135	   who, what, when, and where -- to pretend that every object in the
1136	   universe can have a uniform minimal description.  Each has a name or
1137	   other identifier, a location, some responsible person or party, and a
1138	   date.  It doesn't matter what type of object it is, or whether one
1139	   plans to read it, interact with it, smoke it, wear it, or navigate
1140	   it.  Of course, this approach is flawed because uniformity of
1141	   description for some object types requires more semantic contortion
1142	   and sacrifice than for others.  That is why at the beginning of this
1143	   document, the ARK was said to be suited to objects that accommodate
1144	   reasonably regular electronic description.

1146	   While insisting on uniformity at the most basic level provides
1147	   powerful cross-domain leverage, the semantic sacrifice is great for
1148	   many applications.  So the ERC also permits a semantically rich and
1149	   nuanced description to co-exist in a record along with a basic
1150	   description.  In that way both sophisticated and naive recipients of
1151	   the record can extract the level of meaning from it that best suits
1152	   their needs and abilities.  Key to unlocking the richer description
1153	   is a controlled vocabulary of ERC record types (not explained in this
1154	   document) that permit knowledgeable recipients to apply defined sets
1155	   of additional assumptions to the record.

1157	7.1.  ERC Syntax

1159	   An ERC record is a sequence of metadata elements ending in a blank
1160	   line.  An element consists of a label, a colon, and an optional
1161	   value.  Here is an example of a record with five elements.

1163	          erc:
1164	          who: Gibbon, Edward
1165	          what: The Decline and Fall of the Roman Empire
1166	          when: 1781
1167	          where: http://www.ccel.org/g/gibbon/decline/

1169	   A long value may be folded (continued) onto the next line by insert�
1170	   ing a newline and indenting the next line.  A value can be thus
1171	   folded across multiple lines.  Here are two example elements, each
1172	   folded across four lines.

1174	          who/created: University of California, San Francisco, AIDS
1175	               Program at San Francisco General Hospital | University
1176	               of California, San Francisco, Center for AIDS Prevention
1177	               Studies
1178	          what/Topic:
1179	                Heart Attack | Heart Failure
1180	               | Heart
1181	                                Diseases

1183	   An element value folded across several lines is treated as if the
1184	   lines were joined together on one long line.  For example, the second
1185	   element from the previous example is considered equivalent to

1187	          what/Topic: Heart Attack | Heart Failure | Heart Diseases

1189	   An element value may contain multiple values, each one separated from
1190	   the next by a `|' (pipe) character.  The element from the previous
1191	   example contains three values.

1193	   For annotation purposes, any line beginning with a `#' (hash) charac�
1194	   ter is treated as if it were not present; this is a "comment" line (a
1195	   feature not available in email or HTTP headers).  For example, the
1196	   following element is spread across four lines and contains two val�
1197	   ues:

1199	          what/Topic:
1200	               Heart Attack
1201	          #    | Heart Failure  -- hold off until next review cycle
1202	               | Heart Diseases

1204	7.2.  ERC Stories

1206	   An ERC record is organized into one or more distinct segments, where
1207	   where each segment tells a story about a different aspect of the
1208	   information resource.  A segment boundary occurs whenever a segment
1209	   label (an element beginning with "erc") is encountered.  The basic
1210	   label "erc:" introduces the story of an object's expression (e.g.,
1211	   its publication, installation, or performance).  The label "erc-
1212	   about:" introduces the story of an object's content (what it is
1213	   about) and "erc-support:" introduces the story of a support
1214	   commitment made to it.  A story segment that concerns the ERC itself
1215	   is introduced by the label "erc-from:".  It is an important segment
1216	   that tells the story of the ERC's provenance.  Elements beginning
1217	   with "erc" are reserved for segment labels and their associated story
1218	   types.  From an earlier example, here is an ERC with two segments.

1220	         erc:
1221	         who:    Lederberg, Joshua
1222	         what:   Studies of Human Families for Genetic Linkage
1223	         when:   1974
1224	         where:  http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf
1225	         erc-support:
1226	         who:    NIH/NLM/LHNCBC
1227	         what:   Permanent, Unchanging Content
1228	         # Note to ops staff:  date needs verification.
1229	         when:   2001 04 21
1230	         where:  http://ark.nlm.nih.gov/yy22948

1232	   Segment stories are told according to journalistic tradition.  While
1233	   any number of pertinent elements may appear in a segment, priority is
1234	   placed on answering the questions who, what, when, and where at the
1235	   beginning of each segment so that readers can make the most important
1236	   selection or rejection decisions as soon as possible.  To make things
1237	   simple, the listed ordering of the questions is maintained in each
1238	   segment (as it happens most people who have been exposed to this
1239	   story telling technique are already familiar with the above order�
1240	   ing).

1242	   The four questions are answered by using corresponding element
1243	   labels.  The four element labels can be re-used in each story seg�
1244	   ment, but their meaning changes depending on the segment (the story
1245	   type) in which they appear.  In the example above, "who" is first
1246	   used to name a document's author and subsequently used to name the
1247	   permanence guarantor (provider).  Similarly, "when" first lists the
1248	   date of object creation and in the next segment lists the date of a
1249	   commitment decision.  Four labels appearing across three segments
1250	   effectively map to twelve semantically distinct elements.  Distinct
1251	   element meanings are mapped to Dublin Core elements in a later sec�
1252	   tion.

1254	7.3.  The ERC Anchoring Story

1256	   Each ERC contains an anchoring story.  It is usually the first
1257	   segment labeled "erc:" and it concerns an "anchoring" expression of
1258	   the object.  An "anchoring" expression is the one that a provider
1259	   deemed the most suitable basic referent given the audience and
1260	   application for which it produced the ERC.  If it sounds like the
1261	   provider has great latitude in choosing its anchoring expression, it
1262	   is because it does.  A typical anchoring story in an ERC for a born-
1263	   digital document would be the story of the document's release on a
1264	   web site; such a document would then be the anchoring expression.

1266	   An anchoring story need not be the central descriptive goal of an ERC
1267	   record.  For example, a museum provider may create an ERC for a
1268	   digitized photograph of a painting but choose to anchor it in the
1269	   story of the original painting instead of the story of the electronic
1270	   likeness; although the ERC may through other segments prove to be
1271	   centrally concerned with describing the electronic likeness, the
1272	   provider may have chosen this particular anchoring story in order to
1273	   make the ERC visible in a way that is most natural to patrons (who
1274	   would find the Mona Lisa under da Vinci sooner than they would find
1275	   it under the name of the person who snapped the photograph or scanned
1276	   the image).  In another example, a provider that creates an ERC for a
1277	   dramatic play as an abstract work has the task of describing a piece
1278	   of intangible intellectual property.  To anchor this abstract object
1279	   in the concrete world, if only through a derivative expression, it
1280	   makes sense for the provider to choose a suitable printed edition of
1281	   the play as the anchoring object expression (to describe in the
1282	   anchoring story) of the ERC.

1284	   The anchoring story has special rules designed to keep ERC processing
1285	   simple and predictable.  Each of the four basic elements (who, what,
1286	   when, and where) must be present, unless a best effort to supply it
1287	   fails.  In the event of failure, the element still appears but a
1288	   special value (described later) is used to explain the missing value.
1289	   While the requirement that each of the four elements be present only
1290	   applies to the anchoring story segment, as usual these elements
1291	   appear at the beginning of the segment and may only be used in the
1292	   prescribed order.  A minimal ERC would normally consist of just an
1293	   anchoring story and the element quartet, as illustrated in the next
1294	   example.

1296	         erc:
1297	         who:   National Research Council
1298	         what:  The Digital Dilemma
1299	         when:  2000
1300	         where: http://books.nap.edu/html/digital%5Fdilemma

1302	   A minimal ERC can be abbreviated so that it resembles a traditional
1303	   compact bibliographic citation that is nonetheless completely machine
1304	   processable.  The required elements and ordering makes it possible to
1305	   eliminate the element labels, as shown here.

1307	         erc: National Research Council | The Digital Dilemma | 2000
1308	                | http://books.nap.edu/html/digital%5Fdilemma

1310	7.4.  ERC Elements

1312	   As mentioned, the four basic ERC elements (who, what, when, and
1313	   where) take on different specific meanings depending on the story
1314	   segment in which they are used.  By appearing in each segment, albeit
1315	   in different guises, the four elements serve as a valuable mnemonic
1316	   device -- a kind of checklist -- for constructing minimal story
1317	   segments from scratch.  Again, it is only in the anchoring segment
1318	   that all four elements are mandatory.

1320	   Here are some mappings between ERC elements and Dublin Core [DCORE]
1321	   elements.

1323	          Segment     ERC Element     Equivalent Dublin Core Element
1324	         ---------    -----------     ------------------------------
1325	            erc          who          Creator/Contributor/Publisher
1326	            erc          what                Title
1327	            erc          when                Date
1328	            erc          where               Identifier
1329	         erc-about       who                  <none>
1330	         erc-about       what                Subject
1331	         erc-about       when                Coverage (temporal)
1332	         erc-about       where               Coverage (spatial)

1334	   The basic element labels may also be qualified to add nuances to the
1335	   semantic categories that they identify.  Elements are qualified by
1336	   appending a `/' (slash) and a qualifier term.  Often qualifier terms
1337	   appear as the past tense form of a verb because it makes re-using
1338	   qualifiers among elements easier.

1340	         who/published:  ...
1341	         when/published: ...
1342	         where/published: ...

1344	   Using past tense verbs for qualifiers also reminds providers and
1345	   recipients that element values contain transient assertions that may
1346	   have been true once, but that tend to become less true over time.
1347	   Recipients that don't understand the meaning of a qualifier can fall
1348	   back onto the semantic category (bucket) designated by the unquali�
1349	   fied element label.  Inevitably recipients (people and software) will
1350	   have diverse abilities in understanding elements and qualifiers.

1352	   Any number of other elements and qualifiers may be used in conjunc�
1353	   tion with the quartet of basic segment questions.  The only semantic
1354	   requirement is that they pertain to the segment's story.  Also, it is
1355	   only the four basic elements that change meaning depending on their
1356	   segment context.  All other elements have meaning independent of the
1357	   segment in which they appear.  If an element label stripped of its
1358	   qualifier is still not recognized by the recipient, a second fall
1359	   back position is to ignore it and rely on the four basic elements.

1361	   Elements may be either Canonical, Provisional, or Local.  Canonical
1362	   elements are officially recognized via a registry as part of the
1363	   metadata vernacular.  All elements, qualifiers, and segment labels
1364	   used in this document up until now belong to that vernacular.  Provi�
1365	   sional elements are also officially recognized via the registry, but
1366	   have only been proposed for inclusion in the vernacular.  To be pro�
1367	   moted to the vernacular, a provisional element passes through a vet�
1368	   ting process during which its documentation must be in order and its
1369	   community acceptance demonstrated.  Local elements are any elements
1370	   not officially recognized in the registry.  The registry [REG] is a
1371	   work in progress.

1373	   Local elements can be immediately distinguishable from Canonical or
1374	   Provisional elements because all terms that begin with an upper case
1375	   letter are reserved for spontaneous local use.  No term beginning
1376	   with an upper case letter will ever be assigned Canonical or Provi�
1377	   sional status, so it should be safe to use such terms for local pur�
1378	   poses.  Any recipient of external ERCs containing such terms will
1379	   understand them to be part of the originating provider's local meta�
1380	   data dialect.  Here's an example ERC with three segments, one local
1381	   element, and two local qualifiers.  The segment boundaries have been
1382	   emphasized by comment lines (which, as before, are ignored by proces�
1383	   sors).

1385	         erc:
1386	         who: Bullock, TH | Achimowicz, JZ | Duckrow, RB
1387	                 | Spencer, SS | Iragui-Madoz, VJ
1388	         what: Bicoherence of intracranial EEG in sleep,
1389	                 wakefulness and seizures
1390	         when: 1997 12 00
1391	         where: http://cogprints.soton.ac.uk/%{
1392	                 documents/disk0/00/00/01/22/index.html %}
1393	         in: EEG Clin Neurophysiol | 1997 12 00 | v103, i6, p661-678
1394	         IDcode: cog00000122
1395	         # ---- new segment ----
1396	         erc-about:
1397	         what/Subcategory: Bispectrum | Nonlinearity | Epilepsy
1398	                 | Cooperativity | Subdural | Hippocampus | Higher moment
1399	         # ---- new segment ----
1400	         erc-from:
1401	         who: NIH/NLM/NCBI
1402	         what: pm9546494
1403	         when/Reviewed: 1998 04 18 021600
1404	         where: http://ark.nlm.nih.gov/12025/pm9546494?

1406	   The local element "IDcode" immediately precedes the "erc-about" seg�
1407	   ment, which itself contains an element with the local qualifier "Sub�
1408	   category".  The second to last element also carries the local quali�
1409	   fier "Reviewed".  Finally, what might be a provisional element "in"
1410	   appears near the end of the first segment.  It might have been pro�
1411	   posed as a way to complete a citation for an object originally
1412	   appearing inside another object (such as an article appearing in a
1413	   journal or an encyclopedia).

1415	7.5.  ERC Element Values

1417	   ERC element values tend to be straightforward strings.  If the
1418	   provider intends something special for an element, it will so
1419	   indicate with markers at the beginning of its value string.  The
1420	   markers are designed to be uncommon enough that they would not likely
1421	   occur in normal data except by deliberate intent.  Markers can only
1422	   occur near the beginning of a string, and once any octet of non-
1423	   marker data has been encountered, no further marker processing is
1424	   done for the element value.  In the absence of markers the string is
1425	   considered pure data; this has been the case with all the examples
1426	   seen thus far.  The fullest form of an element value with all three
1427	   optional markers in place looks like this.

1429	         VALUE =    [markup_flags]    (:ccode)    ,    DATA

1431	   In processing, the first non-whitespace character of an ERC element
1432	   value is examined.  An initial `[' is reserved to introduce a brack�
1433	   eted set of markup flags (not described in this document) that ends
1434	   with `]'.  If ERC data is machine-generated, each value string may be
1435	   preceded by "[]" to prevent any of its data from being mistaken for
1436	   markup flags.  Once past the optional markup, the remaining value may
1437	   optionally begin with a controlled code.  A controlled code always
1438	   has the form "(:ccode)", for example,

1440	         who: (:unkn) Anonymous
1441	         what: (:791) Bee Stings

1443	   Any string after such a code is taken to be an uncontrolled (e.g.,
1444	   natural language) equivalent.  The code "unkn" indicates a conven�
1445	   tional explanation for a missing value (stating that the value is
1446	   unknown).  The remainder of the string makes an equivalent statement
1447	   in a form that the provider deemed most suitable to its (probably
1448	   human) audience.  The code "791" could be a fixed numeric topic iden�
1449	   tifier within an unspecified topic vocabulary.  Any code may be
1450	   ignored by those that do not understand it.

1452	   There are several codes to explain different ways in which a required
1453	   element's value may go missing.

1455	         (:unkn)   unknown (e.g., Anonymous, Inconnue)
1456	         (:unav)   value unavailable indefinitely
1457	         (:unac)   temporarily inaccessible
1458	         (:unap)   not applicable, makes no sense
1459	         (:unas)   value unassigned (e.g., Untitled)
1460	         (:none)   never had a value, never will
1461	         (:null)   explicitly empty
1462	         (:unal)   unallowed, suppressed intentionally

1464	   Once past an optional controlled code, the remaining string value is
1465	   subjected to one final test.  If the first next non-whitespace char�
1466	   acter is a `,' (comma), it indicates that the string value is "sort-
1467	   friendly".  This means that the value is (a) laid out with an
1468	   inverted word order useful for sorting items having comparably laid
1469	   out element values (items might be the containing ERC records) and
1470	   (b) that the value may contain other commas that indicate inversion
1471	   points should it become necessary to recover the value in natural
1472	   word order.  Typically, this feature is used to express Western-style
1473	   personal names in family-name-given-name order.  It can also be used
1474	   wherever natural word order might make sorting tricky, such as when
1475	   data contains titles or corporate names.  Here are some example ele�
1476	   ments.

1478	         who:   ,  van Gogh, Vincent
1479	         who:,Howell, III, PhD, 1922-1987, Thurston
1480	         who:, Acme Rocket Factory, Inc., The
1481	         who:, Mao Tse Tung
1482	         who:, McCartney, Paul, Sir,
1483	         what:, Health and Human Services, United States Government
1484	                 Department of, The,

1486	   There are rules to use in recovering a copy of the value in natural
1487	   word order, if desired.  The above example strings have the following
1488	   natural word order values, respectively.

1490	         Vincent van Gogh
1491	         Thurston Howell, III, PhD, 1922-1987
1492	         The Acme Rocket Factory, Inc.
1493	         Mao Tse Tung
1494	         Sir Paul McCartney
1495	         The United States Government Department of Health and Human Services

1497	7.6.  ERC Element Encoding and Dates

1499	   Some characters that need to appear in ERC element values might
1500	   conflict with special characters used for structuring ERCs, so there
1501	   needs to be a way to include them as literal characters that are
1502	   protected from special interpretation.  This is accomplished through
1503	   an encoding mechanism that resembles the %-encoding familiar to [URI]
1504	   handlers.

1506	   The ERC encoding mechanism also uses `%', but instead of taking two
1507	   following hexadecimal digits, it takes one non-alphanumeric character
1508	   or two alphabetic characters that cannot be mistaken for hex digits.
1509	   It is designed not to be confused with normal web-style %-encoding.
1510	   In particular it can be decoded without risking unintended decoding
1511	   of normal %-encoded data (which would introduce errors).  Here are
1512	   the one-character (non-alphanumeric) ERC encoding extensions.

1514	         ERC       Purpose
1515	         ---     ------------------------------------------------
1516	         %!      decodes to the element separator `|'
1517	         %%      decodes to a percent sign `%'
1518	         %.      decodes to a comma `,'
1519	         %_      a non-character used as syntax shim
1520	         %{      a non-character that begins an expansion block
1521	         %}      a non-character that ends an expansion block

1523	   One particularly useful construct in ERC element values is the pair
1524	   of special encoding markers ("%{" and "%}") that indicates a
1525	   "expansion" block.  Whatever string of characters they enclose will
1526	   be treated as if none of the contained whitespace (SPACEs, TABs, New�
1527	   lines) were present.  This comes in handy for writing long, multi-
1528	   part URLs in a readable way.  For example, the value in

1530	         where: http://foo.bar.org/node%{
1531	                    ? db = foo
1532	                    & start = 1
1533	                    & end = 5
1534	                    & buf = 2
1535	                    & query = foo + bar + zaf
1536	                %}

1538	   is decoded into an equivalent element, but with a correct and intact
1539	   URL:

1541	     where:
1542	      http://foo.bar.org/node?db=foo&start=1&end=5&buf=2&query=foo+bar+zaf

1544	   In a parting word about ERC element values, a commonly recurring
1545	   value type is a date, possibly followed by a time.  ERC dates take on
1546	   one of the following forms:

1548	         1999                (four digit year)
1549	         2000 12 29          (year, month, day)
1550	         2000 12 29 235955   (year, month, day, hour, minute, second)

1552	   21 Spring 31 1st quarter      25 Spring (so. hemisphere) 22 Summer 32
1553	   2nd quarter       26 Summer (so. hemisphere) 23 Fall        33 3rd
1554	   quarter      27 Fall (so. hemisphere) 24 Winter 34 4th quar�
1555	   ter      28 Winter (so. hemisphere) In dates, all internal whitespace
1556	   is squeezed out to achieve a normalized form suitable for lexical
1557	   comparison and sorting.  This means that the following dates

1559	         2000 12 29 235955           (recommended for readability)
1560	         2000 12 29 23 59 55
1561	         20001229 23 59 55
1562	         20001229235955              (normalized date and time)

1564	   are all equivalent.  The first form is recommended for readability.
1565	   The last form (shortest and easiest to compute with) is the normal�
1566	   ized form.  Hyphens and commas are reserved to create date ranges and
1567	   lists, for example,

1569	         1996-2000                   (a range of four years)
1570	         1952, 1957, 1969            (a list of three years)
1571	         1952, 1958-1967, 1985       (a mixed list of dates and ranges)
1572	         20001229-20001231           (a range of three days)

1574	7.7.  ERC Stub Records and Internal Support

1576	   The ERC design introduces the concept of a "stub" record, which is an
1577	   incomplete ERC record intended to be supplemented with additional
1578	   elements before being released as a standalone ERC record.  A stub
1579	   ERC record has no minimum required elements.  It is just a group of
1580	   elements that does not begin with "erc:" but otherwise conforms to
1581	   the ERC record syntax.

1583	   ERC stubs may be useful in supporting internal procedures using the
1584	   ERC syntax.  Often they rely on the convenience and accuracy of
1585	   automatically supplied elements, even the basic ones.  To be ready
1586	   for external use, however, an ERC stub must be transformed into a
1587	   complete ERC record having the usual required elements.  An ERC stub
1588	   record can be convenient for metadata embedded in a document, where
1589	   elements such as location, modification date, and size -- which one
1590	   would not omit from an externalized record -- are omitted simply
1591	   because they are much better supplied by a computation.  A separate
1592	   local administrative procedure, not defined for ERC's in general,
1593	   would effect the promotion of stubs into complete records.

1595	   While the ERC is a general-purpose container for exchange of resource
1596	   descriptions, it does not dictate how records must be internally
1597	   stored, laid out, or assembled by data providers or recipients.
1598	   Arbitrary internal descriptive frameworks can support ERCs simply by
1599	   mapping (e.g., on demand) local records to the ERC container format
1600	   and making them available for export.  Therefore, to support ERCs
1601	   there is no need for a data provider to convert internal data to be
1602	   stored in an ERC format.  On the other hand, any provider (such as
1603	   one just getting started in the business of resource description) may
1604	   choose to store and manipulate local data natively in the ERC format.

1606	8.  Advice to Web Clients

1608	   This section offers some advice to web client software developers.
1609	   It is hard to write about because it tries to anticipate a series of
1610	   events that might lead to native web browser support for ARKs.

1612	   ARKs are envisaged to appear wherever durable object references are
1613	   planned.  Library cataloging records, literature citations, and
1614	   bibliographies are important examples.  In many of these places URLs
1615	   (Uniform Resource Locators) currently stand in, and URNs, DOIs, and
1616	   PURLs have been proposed as alternatives.

1618	   The strings representing ARKs are also envisaged to appear in some of
1619	   the places where URLs currently appear:  in hypertext links (where
1620	   they are not normally shown to users) and in rendered text (displayed
1621	   or printed).  Internet search engines, for example, tend to include
1622	   both actionable and manifest links when listing each item found.  A
1623	   normal HTML link for which the URL is not displayed looks like this.

1625	          <a href = "http://foo.bar.org/index.htm"> Click Here <a>

1627	   The same link with an ARK instead of a URL:

1629	          <a href = "ark:/14697/b12345x"> Click Here <a>

1631	   Web browsers would in general require a small modification to recog�
1632	   nize and convert this ARK, via mapping authority discovery, to the
1633	   URL form.

1635	          <a href = "http://a.b.org/ark:/14697/b12345x"> Click Here <a>

1637	   A browser that knows how to make that conversion could also automati�
1638	   cally detect and replace a non-working NMAH.

1640	   An NAA will typically make known the associations it creates by pub�
1641	   lishing them in catalogs, actively advertizing them, or simply leav�
1642	   ing them on web sites for visitors (e.g., users, indexing spiders) to
1643	   stumble across in browsing.

1645	9.  Security Considerations

1647	   The ARK naming scheme poses no direct risk to computers and networks.
1648	   Implementors of ARK services need to be aware of security issues when
1649	   querying networks and filesystems for Name Mapping Authority
1650	   services, and the concomitant risks from spoofing and obtaining
1651	   incorrect information.  These risks are no greater for ARK mapping
1652	   authority discovery than for other kinds of service discovery.  For
1653	   example, recipients of ARKs with a specified hostport (NMAH) should
1654	   treat it like a URL and be aware that the identified ARK service may
1655	   no longer be operational.

1657	   Apart from mapping authority discovery, ARK clients and servers
1658	   subject themselves to all the risks that accompany normal operation
1659	   of the protocols underlying mapping services (e.g., HTTP, Z39.50).
1660	   As specializations of such protocols, an ARK service may limit
1661	   exposure to the usual risks.  Indeed, ARK services may enhance a kind
1662	   of security by helping users identify long-term reliable references
1663	   to information objects.

1665	10.  Authors' Addresses

1667	   John A. Kunze
1668	   California Digital Library
1669	   University of California, Office of the President
1670	   415 20th St, 4th Floor
1671	   Oakland, CA  94612-3550, USA

1673	   Fax:   +1 510-893-5212
1674	   EMail: jak@ucop.edu
1675	   R. P. C. Rodgers
1676	   US National Library of Medicine
1677	   8600 Rockville Pike, Bldg. 38A
1678	   Bethesda, MD  20894, USA

1680	   Fax:   +1 301-496-0673
1681	   EMail: rodgers@nlm.nih.gov

1683	11.  References

1685	   [DCORE]    Dublin Core Metadata Initiative, "Dublin Core Metadata
1686	              Element Set, Version 1.1:  Reference Description", July
1687	              1999, http://dublincore.org/documents/dces/.

1689	   [DNS]      P.V. Mockapetris, "Domain Names - Concepts and
1690	              Facilities", RFC 1034, November 1987.

1692	   [DOI]      International DOI Foundation, "The Digital Object
1693	              Identifier (DOI) System", February 2001,
1694	              http://dx.doi.org/10.1000/203.

1696	   [EMHDRS]   D. Crocker, "Standard for the format of ARPA Internet text
1697	              messages", RFC 822, August 1982.

1699	   [ERC]      J. Kunze, "Electronic Resource Citations", work in
1700	              progress.

1702	   [HKMP]     J. Kunze, "HTTP Key Mapping Protocol", work in progress.

1704	   [HTTP]     R. Fielding, et al, "Hypertext Transfer Protocol --
1705	              HTTP/1.1", RFC 2616, June 1999.

1707	   [MD5]      R. Rivest, "The MD5 Message-Digest Algorithm", RFC 1321,
1708	              April 1992.

1710	   [NAPTR]    M. Mealling, Daniel, R., "The Naming Authority Pointer
1711	              (NAPTR) DNS Resource Record", RFC 2915, September 2000.

1713	   [NLMPerm]  M. Byrnes, "Defining NLM's Commitment to the Permanence of
1714	              Electronic Information", ARL 212:8-9, October 2000,
1715	              http://www.arl.org/newsltr/212/nlm.html

1717	   [PURL]     K. Shafer, et al, "Introduction to Persistent Uniform
1718	              Resource Locators", 1996,
1719	              http://purl.oclc.org/OCLC/PURL/INET96

1721	   [REG]      J. Kunze, "Resource Metadata Vocabulary", work in
1722	              progress.

1724	   [URI]      T. Berners-Lee, et al, "Uniform Resource Identifiers
1725	              (URI): Generic Syntax", RFC 2396, August 1998.

1727	   [URNBIB]   C. Lynch, et al, "Using Existing Bibliographic Identifiers
1728	              as Uniform Resource Names", RFC 2288, February 1998.

1730	   [URNSYN]   R. Moats, "URN Syntax", RFC 2141, May 1997.

1732	   [URNNID]   L. Daigle, et al, "URN Namespace Definition Mechanisms",
1733	              RFC 2611, June 1999.

1735	   [TELNET]   J. Postel, J.K. Reynolds, "Telnet Protocol Specification",
1736	              RFC 854, May 1983.

1738	12.  Appendix:  An NLM Prototype ARK Service

1740	   The US National Library of Medicine (NLM) has an experimental,
1741	   prototype ARK service under development.  It is being made available
1742	   for purposes of demonstrating various aspects of the ARK system, but
1743	   is subject to temporary or permanent withdrawal (without notice)
1744	   depending upon the circumstances of the small research group
1745	   responsible for making it available.  It is described at:

1747	         http://ark.nlm.nih.gov/

1749	   Comments and feedback may be addressed to rodgers@nlm.nih.gov.

1751	13.  Appendix:  Current ARK Name Authority Table

1753	   This appendix contains a copy of the Name Authority Table (a file) at
1754	   the time of writing.  It may be loaded into a local filesystem (e.g.,
1755	   /etc/natab) for use in mapping NAAs (Name Assigning Authorities) to
1756	   NMAHs (Name Mapping Authority Hostports).  It contains Perl code that
1757	   can be copied into a standalone script that processes the table (as a
1758	   file).  Because this is still a proposed file, none of the values in
1759	   it are real.

1761	     #
1762	     # Name Assigning Authority / Name Mapping Authority Lookup Table
1763	     #       Last change:   31 July 2003
1764	     #       Reload from:   http://ark.nlm.nih.gov/etc/natab
1765	     #       Mirrored at:   http://ark.cdlib.org/natab
1766	     #       To register:   mailto:jak@ucop.edu?Subject=naareg
1767	     #       Process with:  Perl script at end of this file (optional)
1768	     #
1769	     # Each NAA appears at the beginning of a line with the NAA Number
1770	     # first, a colon, and an ARK or URL to a statement of naming policy
1771	     # (see http://ark.cdlib.org for an example).
1772	     # All the NMA hostports that service an NAA are listed, one per
1773	     # line, indented, after the corresponding NAA line.
1774	     #
1775	     #       National Library of Medicine
1776	     12025:  http://www.nlm.nih.gov/xxx/naapolicy.html
1777	             ark.nlm.nih.gov USNLM
1778	             foobar.zaf.org UCSF
1779	             sneezy.dopey.com BIREME
1780	     #
1781	     #       Library of Congress
1782	     12026:  http://www.loc.gov/xxx/naapolicy.html
1783	             foobar.zaf.org USLC
1784	             sneezy.dopey.com USLC
1785	     #
1786	     #       National Agriculture Library
1787	     12027:  http://www.nal.gov/xxx/naapolicy.html
1788	             foobar.zaf.gov:80 USNAL
1789	     #
1790	     #       University of California
1791	     13030:  http://ark.cdlib.org/
1792	             ark.cdlib.org CDL
1793	     #
1794	     #       World Intellectual Property Organization
1795	     13038:  http://www.wipo.int/xxx/naapolicy.html
1796	             www.wipo.int WIPO
1797	     #
1798	     #--- end of data ---
1799	     # The following Perl script takes an NAA as argument and outputs
1800	     # the NMAs in this file listed under any matching NAA.
1801	     #
1802	     # my $naa = shift;
1803	     # while (<>) {
1804	     #       next if (! /^$naa:/);
1805	     #       while (<>) {
1806	     #               last if (! /^[#\s]./);
1807	     #               print "$1\n" if (/^\s+(\S+)/);
1808	     #       }
1809	     # }
1810	     #
1811	     # Create a g/t/nroff-safe version of this table with the UNIX command,
1812	     #
1813	     #       expand natab | sed 's/\\/\\\e/g' > natab.roff
1814	     #
1815	     # end of file

1817	14.  Copyright Notice

1819	   Copyright (C) The Internet Society (2003).  All Rights Reserved.

1821	   This document and translations of it may be copied and furnished to
1822	   others, and derivative works that comment on or otherwise explain it
1823	   or assist in its implementation may be prepared, copied, published
1824	   and distributed, in whole or in part, without restriction of any
1825	   kind, provided that the above copyright notice and this paragraph are
1826	   included on all such copies and derivative works.  However, this
1827	   document itself may not be modified in any way, such as by removing
1828	   the copyright notice or references to the Internet Society or other
1829	   Internet organizations, except as needed for the  purpose of
1830	   developing Internet standards in which case the procedures for
1831	   copyrights defined in the Internet Standards process must be
1832	   followed, or as required to translate it into languages other than
1833	   English.

1835	   The limited permissions granted above are perpetual and will not be
1836	   revoked by the Internet Society or its successors or assigns.

1838	   This document and the information contained herein is provided on an
1839	   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
1840	   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
1841	   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
1842	   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
1843	   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

1845	   The IETF invites any interested party to bring to its attention any
1846	   copyrights, patents or patent applications, or other proprietary
1847	   rights which may cover technology that may be required to practice
1848	   this standard.  Please address the information to the IETF Executive
1849	   Director.

1851	Expires 31 January 2004
1852	                           Table of Contents

1854	Status of this Document  . . . . . . . . . . . . . . . . . . . . . .   1
1855	Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   1
1856	1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .   3
1857	1.1.  Three Reasons to Use ARKs  . . . . . . . . . . . . . . . . . .   3
1858	1.2.  Organizing Support for ARKs  . . . . . . . . . . . . . . . . .   4
1859	1.3.  A Definition of Identifier . . . . . . . . . . . . . . . . . .   5
1860	2.  ARK Anatomy  . . . . . . . . . . . . . . . . . . . . . . . . . .   6
1861	2.1.  The Name Mapping Authority Hostport (NMAH) . . . . . . . . . .   6
1862	2.2.  The Name Assigning Authority Number (NAAN) . . . . . . . . . .   7
1863	2.3.  The Name Part  . . . . . . . . . . . . . . . . . . . . . . . .   7
1864	2.3.1.  Names that Reveal Object Hierarchy . . . . . . . . . . . . .   8
1865	2.3.2.  Names that Reveal Object Variants  . . . . . . . . . . . . .   9
1866	2.3.3.  Hyphens are Ignored  . . . . . . . . . . . . . . . . . . . .  10
1867	2.4.  Normalization and Lexical Equivalence  . . . . . . . . . . . .  10
1868	2.5.  Naming Considerations  . . . . . . . . . . . . . . . . . . . .  11
1869	3.  Assigners of ARKs  . . . . . . . . . . . . . . . . . . . . . . .  12
1870	4.  Finding a Name Mapping Authority . . . . . . . . . . . . . . . .  13
1871	4.1.  Looking Up NMAHs in a Globally Accessible File . . . . . . . .  14
1872	4.2.  Looking up NMAHs Distributed via DNS . . . . . . . . . . . . .  17
1873	5.  Generic ARK Service Definition . . . . . . . . . . . . . . . . .  19
1874	5.1.  Generic ARK Access Service (access, location)  . . . . . . . .  19
1875	5.2.  Generic Policy Service (permanence, naming, etc.)  . . . . . .  20
1876	5.3.  Generic Description Service  . . . . . . . . . . . . . . . . .  21
1877	6.  Overview of the HTTP Key Mapping Protocol (HKMP) . . . . . . . .  21
1878	7.  Overview of Electronic Resource Citations (ERCs) . . . . . . . .  24
1879	7.1.  ERC Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  25
1880	7.2.  ERC Stories  . . . . . . . . . . . . . . . . . . . . . . . . .  26
1881	7.3.  The ERC Anchoring Story  . . . . . . . . . . . . . . . . . . .  27
1882	7.4.  ERC Elements . . . . . . . . . . . . . . . . . . . . . . . . .  28
1883	7.5.  ERC Element Values . . . . . . . . . . . . . . . . . . . . . .  30
1884	7.6.  ERC Element Encoding and Dates . . . . . . . . . . . . . . . .  32
1885	7.7.  ERC Stub Records and Internal Support  . . . . . . . . . . . .  34
1886	8.  Advice to Web Clients  . . . . . . . . . . . . . . . . . . . . .  34
1887	9.  Security Considerations  . . . . . . . . . . . . . . . . . . . .  35
1888	10.  Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . .  35
1889	11.  References  . . . . . . . . . . . . . . . . . . . . . . . . . .  36
1890	12.  Appendix:  An NLM Prototype ARK Service . . . . . . . . . . . .  37
1891	13.  Appendix:  Current ARK Name Authority Table . . . . . . . . . .  37
1892	14.  Copyright Notice  . . . . . . . . . . . . . . . . . . . . . . .  39