idnits 2.17.1 

draft-iab-char-rep-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (February 16, 2004) is 7373 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: 'RFCYYYY' on line 372

  -- Looks like a reference, but probably isn't: 'ISO10646' on line 373

  ** Obsolete normative reference: RFC 2396 (ref. '1') (Obsoleted by RFC 3986)

  == Outdated reference: A later version (-11) exists of draft-duerst-iri-05


     Summary: 3 errors (**), 0 flaws (~~), 3 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                          L. Daigle
3	Internet-Draft                                                 T. Hardie
4	Expires: August 16, 2004                                          Editor
5	                                             Internet Architecture Board
6	                                                                     IAB
7	                                                       February 16, 2004

9	    Considerations on Increasing Character Repertoires for Protocol
10	                          Actionable Elements
11	                         draft-iab-char-rep-01

13	Status of this Memo

15	   This document is an Internet-Draft and is in full conformance with
16	   all provisions of Section 10 of RFC2026.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on August 16, 2004.

36	Copyright Notice

38	   Copyright (C) The Internet Society (2004).  All Rights Reserved.

40	Abstract

42	   This document describes a set of considerations and strategies to use
43	   in increasing the character repertoire available in a protocol
44	   actionable element or suite of protocol actionable elements.  This
45	   document is not meant to provide normative instruction to protocol
46	   designers, but does hope to provide guidance on common issues arising
47	   from this task.  Feedback should be sent to the editors or the IAB.

49	Table of Contents

51	   1.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  3
52	   2.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
53	   3.  Avoidance mechanisms . . . . . . . . . . . . . . . . . . . . .  5
54	   3.1 Choosing a large initial character repertoire  . . . . . . . .  5
55	   3.2 Choosing opaque protocol tokens  . . . . . . . . . . . . . . .  5
56	   3.3 Expansion mechanisms . . . . . . . . . . . . . . . . . . . . .  6
57	   3.4 Replace  . . . . . . . . . . . . . . . . . . . . . . . . . . .  6
58	   3.5 Subsume  . . . . . . . . . . . . . . . . . . . . . . . . . . .  7
59	   3.6 Map  . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  7
60	   4.  Layering a presentation element on a new protocol element  . .  9
61	   5.  Selecting a strategy . . . . . . . . . . . . . . . . . . . . . 10
62	   6.  Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . 11
63	   6.1 Uniform and Internationalized Resource Identifiers . . . . . . 11
64	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 12
65	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 13
66	   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 14
67	       References . . . . . . . . . . . . . . . . . . . . . . . . . . 15
68	       Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 15
69	       Full Copyright Statement . . . . . . . . . . . . . . . . . . . 16

71	1. Definitions

73	   Protocol (actionable) element: A protocol actionable element, or
74	   protocol element is any portion of a message which affects processing
75	   of that message by the protocol in question.  In general, protocol
76	   elements are bound to specific processing choices by membership in a
77	   set of predetermined tokens or by explicit structure.  Protocol
78	   elements are context dependent in that the processing for a token is
79	   specific to a protocol.  To IP, for example, a TCP port number is
80	   payload; to TCP it is a protocol element.  Similarly, to TCP a
81	   Content-encoding: header is payload; to HTTP, it is a protocol
82	   element.

84	   Character repertoire: A character repertoire is the set of all
85	   characters in all permitted encodings which may be used in a protocol
86	   element.  Each element in a character repertoire is a tuple of a code
87	   point and an encoding.  Thus the glyph "a" would appear three times
88	   in a character repertoire that permitted ASCII, iso-8859-1, and iso-
89	   8859-7.

91	   Character set:  As it says, "a set of characters", but more
92	   particularly a set of characters as represented by code points in a
93	   particular encoding.

95	2. Introduction

97	   After a protocol's initial deployment, changes in the use of the
98	   protocol sometimes neccesitate revisiting the character repertoire
99	   originally chosen for one or more of the elements which make up the
100	   protocol.  On rare occasions, this occurs because the protocol
101	   designers need to increase the number of tokens available in a fixed-
102	   length field and choose to do so by increasing the number of
103	   characters which may be used.  More commonly, the motive for the
104	   increase of a character repertoire is the exposure of a protocol
105	   element to a user community.  Once this leakage occurs, there is
106	   often pressure to expand the permitted character repertoire of the
107	   protocol element to match the character repertoire in use in that
108	   community.

110	   Though increasing a character repertoire may appear to be a
111	   relatively simple matter, there are a number of protocol processing
112	   functions which may be affected.  First among these is matching.
113	   Many encodings have very specific matching rules or equivalence
114	   tables; increasing a character repertoire to include a new encoding
115	   implies that the protocol must specify how matching works in that
116	   encoding.  Like matching, sorting works in different ways in
117	   different encoding schemes, and including a new encoding means
118	   specifying sorting algorithms for use with it.  Transformation
119	   presents some unique issues, as it may be possible for some systems
120	   to map only unidirectionally from one encoding to another.  Any of
121	   these, and more, can present problems to a protocol designer who must
122	   post-facto retrofit an increased character repertoire into a deployed
123	   protocol.

125	3. Avoidance mechanisms

127	   To avoid the need to increase character repertoires at some later
128	   date, protocol designers can either start with a character repertoire
129	   which is large enough to encompass that in use in the target user
130	   community or use protocol elements that are sufficiently opaque to a
131	   human user that their leakage is unlikely to present later pressure.
132	   Both strategies, unfortunately, have been notororiously difficult to
133	   get right.

135	3.1 Choosing a large initial character repertoire

137	   In this avoidance strategy, the protocol designers presume that their
138	   protocol elements will leak in the future and provide a character
139	   repertoire which is sufficiently rich to match the user community's
140	   needs.  Increasing use of a protocol, however, often changes the
141	   target user community beyond the intial designers projections.  A
142	   character repetoire which looks large to one user community may be
143	   completely wrong or very limited to another.  When protocol designers
144	   attempt to avoid the issue by using a character repertoire with a
145	   very large number of code points in a very large number of encodings,
146	   they incurr real costs in parser complexity, processing overhead, and
147	   bloat.  They also risk that misconfiguration of these complex parsers
148	   will result in incorrect protocol processing.

150	3.2 Choosing opaque protocol tokens

152	   In the second case, designers who choose to use tokens or structure
153	   which are not human-readable can resist later pressure to increase
154	   the character repertoire available.  As those who have used encodings
155	   like ASN.1 can attest, there is, however, an increased development
156	   cost, as those working with the protocol must develop an
157	   understanding of the use of the tokens or structure without the aid
158	   of readability.  This avenue may also be blocked or narrowed to
159	   protocol designers who will need to pass the new elements among
160	   different protocols; in those cases, the new protocol is either
161	   constrained by the previous choices or must provide a normative
162	   mapping to them.

164	   When designers use tokens or structures which are not human readable,
165	   it is common to create a presentation format or layer which is mapped
166	   to the tokens or structures.  One of the advantage to this approach
167	   is that new mappings can be defined as new user communities express
168	   the need for them.  It is important, however, that these are always
169	   retained as mappings to the protocol elements, and are not treated as
170	   protocol elements themselves.

172	3.3 Expansion mechanisms

174	   For designers who must increase the character repertoire for a
175	   particular protocol element, there are three basic strategies
176	   available: they may replace the existing protocol element with a new
177	   one; they may subsume the character repertoire of the existing
178	   protocol element in a new one; they may map the new character
179	   repertoire into the existing repertoire.

181	   For each of the following strategies, consider the following example:
182	   a protocol element called "POSTAL" used to name the U.S.  zip code in
183	   which the network element is placed cannot handle postal codes
184	   containing characters outside (0,1,2,3,4,5,6,7,8,9) encoded in a
185	   subset of US-ASCII.  We will refer to this character set as (NUM-
186	   ASCII).  The original character repertoire for this protocol element
187	   has NUM-ASCII as its single member character set.

189	3.4 Replace

191	   Replacing an existing protocol element with an entirely new protocol
192	   element with a different character repertoire is by far the cleanest
193	   solution from a design perspective.  A new protocol element may have
194	   its own matching and sorting rules, without regard to any previous
195	   deployment.  This means that the new element will have as little
196	   baggage as is possible when updating parsers and setting forth how it
197	   fits into the protocol's semantics.

199	   Unfortunately, this method presents a raft of deployment problems.
200	   Since existing protocol implementations will know nothing about it,
201	   they cannot be interoperable with any entirely new protocol element.
202	   At best, they can ignore it gracefully; at worst, they will fail.  A
203	   protocol designer can react to this by changing the revision number
204	   on a protocol, by using some form of feature negotiation, or by using
205	   heuristics (including failure!) to determine whether or not a new
206	   protocol element may be used.  All of these are difficult to get
207	   right, especially in hop-by-hop protocols, in which it may not be
208	   possible to determine whether all hops support specific features or
209	   versions.

211	   A protocol designer tackling this problem for the protocol element
212	   naming the postal code in which a network element is placed might
213	   replace "POSTAL" with "NEW_POSTAL" and create a new character
214	   repertoire for "NEW_POSTAL" which contained the single entry (ISO-
215	   8859-1).  [This is merely an example; the choice of which character
216	   set or sets to use would be made in this instance by reference to the
217	   relevant international postal standards.] Obviously, any system which
218	   did not understand "NEW_POSTAL" would need to be upgraded to handle
219	   the new character set.  Depending on the transition mechanism,
220	   systems communicating postal codes which were numeric-only might well
221	   include both "POSTAL" and "NEW_POSTAL" protocol elements.

223	3.5 Subsume

225	   Rather than completely replacing an existing protocol element,
226	   another strategy is to create a protocol element which subsumes the
227	   character repertoire of the existing protocol element.  When this
228	   option is chosen, the new protocol element retains all the character
229	   sets and the related matching and sorting rules which were originally
230	   present.  These become a strict subset of the new character
231	   repertoire.

233	   This strategy limits the functionality of the new protocol element
234	   both by forcing it to include specific character sets and by
235	   requiring that the semantics of the new protocol element exactly
236	   match the existing protocol element.  This strategy also retains many
237	   of the deployment problems of the replacement strategy, though it
238	   offers some opportunities to mitigate the issues.  Like the
239	   replacement strategy, there may need to be negotiation mechanisms
240	   capable of handling both protocol elements, though new
241	   implementations can sometimes treat the old protocol element as a
242	   degenerate case of new protocol element.

244	   If our "POSTAL" protocol design team took this strategy, they might
245	   replace the (NUM-ASCII) character repertoire of "POSTAL" with a new
246	   protocol element "BIG_POSTAL" for which the character repertoire is
247	   (NUM-ASCII, US-ASCII).  Because NUM-ASCII is a strict subset of US-
248	   ASCII, the protocol can treat all "POSTAL" protocol elements as if
249	   they were "BIG_POSTAL" protocol elements.  Note that this is the
250	   simplest possible example of this particular strategy, as there is no
251	   need to mark which character set from the character repertoire is in
252	   use.  More complex examples may require much more complex processing
253	   to achieve the same results.

255	3.6 Map

257	   In some instances it may be possible and desirable to map an expanded
258	   character repertoire onto the existing code points specified by a
259	   protocol.  In this case, the code points are themselves retained but
260	   the character encoding portion of the tuple is changed to create an
261	   expanded character repertoire.  This strategy can only work when some
262	   marker is used to indicate which character encoding applies to a
263	   specific instance of the protocol.  This marker must be something
264	   which is non-operative in the original protocol processing, or the
265	   strategy will incur the negotiation costs mentioned above.  This
266	   strategy will tend to increase the size of protocol elements unless
267	   the original code points were radically under-used.  It also carries
268	   the near-certainty that there will be occasions in which protocol
269	   elements encoded with the new character encoding are mis-identified
270	   as being encoded with the original character encoding.

272	   This strategy has somewhat unique deployment consequences, in that it
273	   is both easier to get initial deployment and harder to get complete
274	   penetration.  Because the same code points are used throughout, there
275	   is no requirement that all systems upgrade for the increased
276	   character repertoire to be available to a subset of users.  There is
277	   also, however, almost no incentive for upgrade of systems which do
278	   not themselves require the increased repertoire.  This is
279	   particularly true in hop-by-hop and commonly proxied protocols,
280	   because the on-path intermediate systems will pass the elements of
281	   the expanded repertoire by virtue of their being legitimate code
282	   points in the original repertoire; they do not need to upgrade and
283	   they probably never will.

285	   For our protocol design team to tackle "POSTAL" using this strategy
286	   they must develop or discover an encoding which allows them to
287	   represent all the needed characters using just (NUM-ASCII).  If, for
288	   example, the character repertoire needed to add a character set which
289	   included (A-Z), but no others, the team could use US-ASCII's three
290	   digit decimal encoding for each included character.  A postal code
291	   like "KLHSW1" would then be encoded as "075076071083087049".
292	   Provided that the original POSTAL protocol element had a field length
293	   sufficient to handle the new encoding, it could carry the new values
294	   without any difficulty.  The difficulty would be determining whether
295	   the new encoding or the old should be assumed; in this limited case,
296	   length alone could be made a marker by padding any short alphabetic
297	   postal codes with the ASCII null character,"OOO", until they reached
298	   a length sufficient to trigger treatment as non-ZIP code postal
299	   codes.  In other cases more complex triggers would be required.

301	4. Layering a presentation element on a new protocol element

303	   It is noted above that designers using non-human readable tokens may
304	   provide a mapping to a presentation element which can be used by
305	   humans working with the protocol.  In employing any of the strategies
306	   above, it is useful for protocol designers to consider introducing a
307	   presentation element at the same time.  This is almost a required
308	   part of the mapping strategy, as using an encoding based on the
309	   original set of code points does not help the user community unless
310	   it can also be mapped to an encoding in common use for presentation.
311	   It may be used with any of them, though, and given the potential for
312	   the introduction of new character encodings, it must be considered
313	   carefully as a method of ensuring that the same problem does not face
314	   the protocol in a few years time.

316	5. Selecting a strategy

318	   The first step in selecting a strategy is identifying the protocol
319	   processing choices which depend on the protocol element.  If a
320	   protocol element is passed among different protocols, this set of
321	   choices must be identified for each of the protocols which depend on
322	   the element.  After those have been identified, the available methods
323	   for passing the protocol elements from one protocol to another must
324	   be considered.

326	   If at all possible, a single strategy should be selected for use with
327	   a specific protocol element, even when that protocol element will be
328	   passed among different protocols.  Since protocol processing is
329	   context-specific, it is technically possible to use different methods
330	   in different contexts, but this increase in complexity rarely has a
331	   corresponding gain.

333	   Whether the protocol element will be used in one protocol or
334	   several,the core question to consider is how best to maintain
335	   interoperability while increasing the character repertoire.  For
336	   example, if creating a new protocol element as a fully fledged
337	   replacement, are there available mechanisms to handle the negotiation
338	   and/or versioning?  Alternatively, are there methods which would
339	   allow both protocol elements to coexist?

341	   The second question to consider is the cost of implementation.  If,
342	   for example, a choice is made to introduce a protocol element which
343	   subsumes the original character repertoire in a larger character
344	   repertoire, how expensive will the increase in parsing complexity be?

346	   The third question to consider is likely deployment patterns.  For a
347	   client/server protocol, will it be feasible to update both client and
348	   server?  For a hop-by-hop protocol, will there be any pressure for
349	   interemdiate servers to upgrade?

351	   A related question is whether this change will be tied to other
352	   changes which will drive adoption, or whether this change will be
353	   unrelated to other updates to the protocol.

355	6. Case Studies

357	6.1 Uniform and Internationalized Resource Identifiers

359	   Uniform Resource Names ([1]) make use of the 7-bit US-ASCII character
360	   repertoire.  The syntax of the URI permits other encodings to be
361	   mapped into that repertoire, by defining a hex-encoding framework.

363	   Increasingly, new URI schemes are using UTF-8 to for characters
364	   beyond US-ASCII.  In recognition of this, and to provide a means to
365	   handle such identifiers in a more straightforward manner, the
366	   "Internationalized Resource Identifier" (IRI) has been introduced.

368	   From [2]:

370	      "This document defines a new protocol element, the
371	      Internationalized Resource Identifier (IRI), as a complement to
372	      the URI [RFCYYYY].  An IRI is a sequence of characters from the
373	      Universal Character Set [ISO10646].  A mapping from IRIs to URIs
374	      is defined, which means that IRIs can be used instead of URIs
375	      where appropriate to identify resources."

377	   The IRI specification applies the "replace", "map" and "subsume"
378	   strategies for expansion outlined above.  As noted in the quoted text
379	   from the IRI document, IRIs are defined as a new protocol element
380	   ("replace").  Therefore, any protocol or message format defined in
381	   the future may use an IRI protocol element and not a URI protocol
382	   element.  However, as URIs are ubiquitous and IRIs would face steep
383	   deployment challenges without the possibility of relating to URIs.
384	   Therefore, [2] defines a mapping strategy to ensure IRIs can be
385	   mapped onto URIs and vice versa.

387	   The IRI document also goes on to note that there are specifications
388	   already designated to handle IRIs -- "anyURI" in XML Schema.  This is
389	   an example of subsumption.

391	   While the IRI document is clear that conversions between IRI and URI
392	   formats must be made when transitioning from systems that understand
393	   IRIs to ones that do not, it is unclear how message parsers that
394	   detect and interpret "http://" as a URI will recognize IRIs as
395	   distinct from (malformed) URIs.

397	7. Security Considerations

399	   Any protocol processing which depends on a specific set of tokens or
400	   structure is at risk when the matching and sorting rules for the set
401	   is indeterminate.  In some cases, this can result in a denial of
402	   service, as legitimate tokens are not recognized; in other cases,
403	   inappropriate access may be granted by matching incorrectly.

405	8. IANA Considerations

407	   There are no IANA considerations defined in this memo.

409	9. Acknowledgements

411	   The authors would like to thank Martin Duerst for his attention and
412	   expertise.

414	References

416	   [1]  Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform Resource
417	        Identifiers (URI): Generic Syntax", RFC 2396, August 1998.

419	   [2]  Duerst, M. and M. Suignard, "Internationalized Resource
420	        Identifiers (IRIs)", draft-duerst-iri-05.txt (work in progress),
421	        October 2003.

423	Authors' Addresses

425	   Leslie Daigle
426	   Editor

428	   Ted Hardie
429	   Editor

431	   Internet Architecture Board
432	   IAB

434	   EMail: iab@iab.org

436	Full Copyright Statement

438	   Copyright (C) The Internet Society (2004).  All Rights Reserved.

440	   This document and translations of it may be copied and furnished to
441	   others, and derivative works that comment on or otherwise explain it
442	   or assist in its implementation may be prepared, copied, published
443	   and distributed, in whole or in part, without restriction of any
444	   kind, provided that the above copyright notice and this paragraph are
445	   included on all such copies and derivative works.  However, this
446	   document itself may not be modified in any way, such as by removing
447	   the copyright notice or references to the Internet Society or other
448	   Internet organizations, except as needed for the purpose of
449	   developing Internet standards in which case the procedures for
450	   copyrights defined in the Internet Standards process must be
451	   followed, or as required to translate it into languages other than
452	   English.

454	   The limited permissions granted above are perpetual and will not be
455	   revoked by the Internet Society or its successors or assigns.

457	   This document and the information contained herein is provided on an
458	   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
459	   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
460	   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
461	   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
462	   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

464	Acknowledgement

466	   Funding for the RFC Editor function is currently provided by the
467	   Internet Society.