idnits 2.17.1 draft-iab-char-rep-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 16, 2004) is 7373 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'RFCYYYY' on line 372 -- Looks like a reference, but probably isn't: 'ISO10646' on line 373 ** Obsolete normative reference: RFC 2396 (ref. '1') (Obsoleted by RFC 3986) == Outdated reference: A later version (-11) exists of draft-duerst-iri-05 Summary: 3 errors (**), 0 flaws (~~), 3 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group L. Daigle 3 Internet-Draft T. Hardie 4 Expires: August 16, 2004 Editor 5 Internet Architecture Board 6 IAB 7 February 16, 2004 9 Considerations on Increasing Character Repertoires for Protocol 10 Actionable Elements 11 draft-iab-char-rep-01 13 Status of this Memo 15 This document is an Internet-Draft and is in full conformance with 16 all provisions of Section 10 of RFC2026. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on August 16, 2004. 36 Copyright Notice 38 Copyright (C) The Internet Society (2004). All Rights Reserved. 40 Abstract 42 This document describes a set of considerations and strategies to use 43 in increasing the character repertoire available in a protocol 44 actionable element or suite of protocol actionable elements. This 45 document is not meant to provide normative instruction to protocol 46 designers, but does hope to provide guidance on common issues arising 47 from this task. Feedback should be sent to the editors or the IAB. 49 Table of Contents 51 1. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 3 52 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 53 3. Avoidance mechanisms . . . . . . . . . . . . . . . . . . . . . 5 54 3.1 Choosing a large initial character repertoire . . . . . . . . 5 55 3.2 Choosing opaque protocol tokens . . . . . . . . . . . . . . . 5 56 3.3 Expansion mechanisms . . . . . . . . . . . . . . . . . . . . . 6 57 3.4 Replace . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 58 3.5 Subsume . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 59 3.6 Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 60 4. Layering a presentation element on a new protocol element . . 9 61 5. Selecting a strategy . . . . . . . . . . . . . . . . . . . . . 10 62 6. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . 11 63 6.1 Uniform and Internationalized Resource Identifiers . . . . . . 11 64 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 65 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13 66 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 14 67 References . . . . . . . . . . . . . . . . . . . . . . . . . . 15 68 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 15 69 Full Copyright Statement . . . . . . . . . . . . . . . . . . . 16 71 1. Definitions 73 Protocol (actionable) element: A protocol actionable element, or 74 protocol element is any portion of a message which affects processing 75 of that message by the protocol in question. In general, protocol 76 elements are bound to specific processing choices by membership in a 77 set of predetermined tokens or by explicit structure. Protocol 78 elements are context dependent in that the processing for a token is 79 specific to a protocol. To IP, for example, a TCP port number is 80 payload; to TCP it is a protocol element. Similarly, to TCP a 81 Content-encoding: header is payload; to HTTP, it is a protocol 82 element. 84 Character repertoire: A character repertoire is the set of all 85 characters in all permitted encodings which may be used in a protocol 86 element. Each element in a character repertoire is a tuple of a code 87 point and an encoding. Thus the glyph "a" would appear three times 88 in a character repertoire that permitted ASCII, iso-8859-1, and iso- 89 8859-7. 91 Character set: As it says, "a set of characters", but more 92 particularly a set of characters as represented by code points in a 93 particular encoding. 95 2. Introduction 97 After a protocol's initial deployment, changes in the use of the 98 protocol sometimes neccesitate revisiting the character repertoire 99 originally chosen for one or more of the elements which make up the 100 protocol. On rare occasions, this occurs because the protocol 101 designers need to increase the number of tokens available in a fixed- 102 length field and choose to do so by increasing the number of 103 characters which may be used. More commonly, the motive for the 104 increase of a character repertoire is the exposure of a protocol 105 element to a user community. Once this leakage occurs, there is 106 often pressure to expand the permitted character repertoire of the 107 protocol element to match the character repertoire in use in that 108 community. 110 Though increasing a character repertoire may appear to be a 111 relatively simple matter, there are a number of protocol processing 112 functions which may be affected. First among these is matching. 113 Many encodings have very specific matching rules or equivalence 114 tables; increasing a character repertoire to include a new encoding 115 implies that the protocol must specify how matching works in that 116 encoding. Like matching, sorting works in different ways in 117 different encoding schemes, and including a new encoding means 118 specifying sorting algorithms for use with it. Transformation 119 presents some unique issues, as it may be possible for some systems 120 to map only unidirectionally from one encoding to another. Any of 121 these, and more, can present problems to a protocol designer who must 122 post-facto retrofit an increased character repertoire into a deployed 123 protocol. 125 3. Avoidance mechanisms 127 To avoid the need to increase character repertoires at some later 128 date, protocol designers can either start with a character repertoire 129 which is large enough to encompass that in use in the target user 130 community or use protocol elements that are sufficiently opaque to a 131 human user that their leakage is unlikely to present later pressure. 132 Both strategies, unfortunately, have been notororiously difficult to 133 get right. 135 3.1 Choosing a large initial character repertoire 137 In this avoidance strategy, the protocol designers presume that their 138 protocol elements will leak in the future and provide a character 139 repertoire which is sufficiently rich to match the user community's 140 needs. Increasing use of a protocol, however, often changes the 141 target user community beyond the intial designers projections. A 142 character repetoire which looks large to one user community may be 143 completely wrong or very limited to another. When protocol designers 144 attempt to avoid the issue by using a character repertoire with a 145 very large number of code points in a very large number of encodings, 146 they incurr real costs in parser complexity, processing overhead, and 147 bloat. They also risk that misconfiguration of these complex parsers 148 will result in incorrect protocol processing. 150 3.2 Choosing opaque protocol tokens 152 In the second case, designers who choose to use tokens or structure 153 which are not human-readable can resist later pressure to increase 154 the character repertoire available. As those who have used encodings 155 like ASN.1 can attest, there is, however, an increased development 156 cost, as those working with the protocol must develop an 157 understanding of the use of the tokens or structure without the aid 158 of readability. This avenue may also be blocked or narrowed to 159 protocol designers who will need to pass the new elements among 160 different protocols; in those cases, the new protocol is either 161 constrained by the previous choices or must provide a normative 162 mapping to them. 164 When designers use tokens or structures which are not human readable, 165 it is common to create a presentation format or layer which is mapped 166 to the tokens or structures. One of the advantage to this approach 167 is that new mappings can be defined as new user communities express 168 the need for them. It is important, however, that these are always 169 retained as mappings to the protocol elements, and are not treated as 170 protocol elements themselves. 172 3.3 Expansion mechanisms 174 For designers who must increase the character repertoire for a 175 particular protocol element, there are three basic strategies 176 available: they may replace the existing protocol element with a new 177 one; they may subsume the character repertoire of the existing 178 protocol element in a new one; they may map the new character 179 repertoire into the existing repertoire. 181 For each of the following strategies, consider the following example: 182 a protocol element called "POSTAL" used to name the U.S. zip code in 183 which the network element is placed cannot handle postal codes 184 containing characters outside (0,1,2,3,4,5,6,7,8,9) encoded in a 185 subset of US-ASCII. We will refer to this character set as (NUM- 186 ASCII). The original character repertoire for this protocol element 187 has NUM-ASCII as its single member character set. 189 3.4 Replace 191 Replacing an existing protocol element with an entirely new protocol 192 element with a different character repertoire is by far the cleanest 193 solution from a design perspective. A new protocol element may have 194 its own matching and sorting rules, without regard to any previous 195 deployment. This means that the new element will have as little 196 baggage as is possible when updating parsers and setting forth how it 197 fits into the protocol's semantics. 199 Unfortunately, this method presents a raft of deployment problems. 200 Since existing protocol implementations will know nothing about it, 201 they cannot be interoperable with any entirely new protocol element. 202 At best, they can ignore it gracefully; at worst, they will fail. A 203 protocol designer can react to this by changing the revision number 204 on a protocol, by using some form of feature negotiation, or by using 205 heuristics (including failure!) to determine whether or not a new 206 protocol element may be used. All of these are difficult to get 207 right, especially in hop-by-hop protocols, in which it may not be 208 possible to determine whether all hops support specific features or 209 versions. 211 A protocol designer tackling this problem for the protocol element 212 naming the postal code in which a network element is placed might 213 replace "POSTAL" with "NEW_POSTAL" and create a new character 214 repertoire for "NEW_POSTAL" which contained the single entry (ISO- 215 8859-1). [This is merely an example; the choice of which character 216 set or sets to use would be made in this instance by reference to the 217 relevant international postal standards.] Obviously, any system which 218 did not understand "NEW_POSTAL" would need to be upgraded to handle 219 the new character set. Depending on the transition mechanism, 220 systems communicating postal codes which were numeric-only might well 221 include both "POSTAL" and "NEW_POSTAL" protocol elements. 223 3.5 Subsume 225 Rather than completely replacing an existing protocol element, 226 another strategy is to create a protocol element which subsumes the 227 character repertoire of the existing protocol element. When this 228 option is chosen, the new protocol element retains all the character 229 sets and the related matching and sorting rules which were originally 230 present. These become a strict subset of the new character 231 repertoire. 233 This strategy limits the functionality of the new protocol element 234 both by forcing it to include specific character sets and by 235 requiring that the semantics of the new protocol element exactly 236 match the existing protocol element. This strategy also retains many 237 of the deployment problems of the replacement strategy, though it 238 offers some opportunities to mitigate the issues. Like the 239 replacement strategy, there may need to be negotiation mechanisms 240 capable of handling both protocol elements, though new 241 implementations can sometimes treat the old protocol element as a 242 degenerate case of new protocol element. 244 If our "POSTAL" protocol design team took this strategy, they might 245 replace the (NUM-ASCII) character repertoire of "POSTAL" with a new 246 protocol element "BIG_POSTAL" for which the character repertoire is 247 (NUM-ASCII, US-ASCII). Because NUM-ASCII is a strict subset of US- 248 ASCII, the protocol can treat all "POSTAL" protocol elements as if 249 they were "BIG_POSTAL" protocol elements. Note that this is the 250 simplest possible example of this particular strategy, as there is no 251 need to mark which character set from the character repertoire is in 252 use. More complex examples may require much more complex processing 253 to achieve the same results. 255 3.6 Map 257 In some instances it may be possible and desirable to map an expanded 258 character repertoire onto the existing code points specified by a 259 protocol. In this case, the code points are themselves retained but 260 the character encoding portion of the tuple is changed to create an 261 expanded character repertoire. This strategy can only work when some 262 marker is used to indicate which character encoding applies to a 263 specific instance of the protocol. This marker must be something 264 which is non-operative in the original protocol processing, or the 265 strategy will incur the negotiation costs mentioned above. This 266 strategy will tend to increase the size of protocol elements unless 267 the original code points were radically under-used. It also carries 268 the near-certainty that there will be occasions in which protocol 269 elements encoded with the new character encoding are mis-identified 270 as being encoded with the original character encoding. 272 This strategy has somewhat unique deployment consequences, in that it 273 is both easier to get initial deployment and harder to get complete 274 penetration. Because the same code points are used throughout, there 275 is no requirement that all systems upgrade for the increased 276 character repertoire to be available to a subset of users. There is 277 also, however, almost no incentive for upgrade of systems which do 278 not themselves require the increased repertoire. This is 279 particularly true in hop-by-hop and commonly proxied protocols, 280 because the on-path intermediate systems will pass the elements of 281 the expanded repertoire by virtue of their being legitimate code 282 points in the original repertoire; they do not need to upgrade and 283 they probably never will. 285 For our protocol design team to tackle "POSTAL" using this strategy 286 they must develop or discover an encoding which allows them to 287 represent all the needed characters using just (NUM-ASCII). If, for 288 example, the character repertoire needed to add a character set which 289 included (A-Z), but no others, the team could use US-ASCII's three 290 digit decimal encoding for each included character. A postal code 291 like "KLHSW1" would then be encoded as "075076071083087049". 292 Provided that the original POSTAL protocol element had a field length 293 sufficient to handle the new encoding, it could carry the new values 294 without any difficulty. The difficulty would be determining whether 295 the new encoding or the old should be assumed; in this limited case, 296 length alone could be made a marker by padding any short alphabetic 297 postal codes with the ASCII null character,"OOO", until they reached 298 a length sufficient to trigger treatment as non-ZIP code postal 299 codes. In other cases more complex triggers would be required. 301 4. Layering a presentation element on a new protocol element 303 It is noted above that designers using non-human readable tokens may 304 provide a mapping to a presentation element which can be used by 305 humans working with the protocol. In employing any of the strategies 306 above, it is useful for protocol designers to consider introducing a 307 presentation element at the same time. This is almost a required 308 part of the mapping strategy, as using an encoding based on the 309 original set of code points does not help the user community unless 310 it can also be mapped to an encoding in common use for presentation. 311 It may be used with any of them, though, and given the potential for 312 the introduction of new character encodings, it must be considered 313 carefully as a method of ensuring that the same problem does not face 314 the protocol in a few years time. 316 5. Selecting a strategy 318 The first step in selecting a strategy is identifying the protocol 319 processing choices which depend on the protocol element. If a 320 protocol element is passed among different protocols, this set of 321 choices must be identified for each of the protocols which depend on 322 the element. After those have been identified, the available methods 323 for passing the protocol elements from one protocol to another must 324 be considered. 326 If at all possible, a single strategy should be selected for use with 327 a specific protocol element, even when that protocol element will be 328 passed among different protocols. Since protocol processing is 329 context-specific, it is technically possible to use different methods 330 in different contexts, but this increase in complexity rarely has a 331 corresponding gain. 333 Whether the protocol element will be used in one protocol or 334 several,the core question to consider is how best to maintain 335 interoperability while increasing the character repertoire. For 336 example, if creating a new protocol element as a fully fledged 337 replacement, are there available mechanisms to handle the negotiation 338 and/or versioning? Alternatively, are there methods which would 339 allow both protocol elements to coexist? 341 The second question to consider is the cost of implementation. If, 342 for example, a choice is made to introduce a protocol element which 343 subsumes the original character repertoire in a larger character 344 repertoire, how expensive will the increase in parsing complexity be? 346 The third question to consider is likely deployment patterns. For a 347 client/server protocol, will it be feasible to update both client and 348 server? For a hop-by-hop protocol, will there be any pressure for 349 interemdiate servers to upgrade? 351 A related question is whether this change will be tied to other 352 changes which will drive adoption, or whether this change will be 353 unrelated to other updates to the protocol. 355 6. Case Studies 357 6.1 Uniform and Internationalized Resource Identifiers 359 Uniform Resource Names ([1]) make use of the 7-bit US-ASCII character 360 repertoire. The syntax of the URI permits other encodings to be 361 mapped into that repertoire, by defining a hex-encoding framework. 363 Increasingly, new URI schemes are using UTF-8 to for characters 364 beyond US-ASCII. In recognition of this, and to provide a means to 365 handle such identifiers in a more straightforward manner, the 366 "Internationalized Resource Identifier" (IRI) has been introduced. 368 From [2]: 370 "This document defines a new protocol element, the 371 Internationalized Resource Identifier (IRI), as a complement to 372 the URI [RFCYYYY]. An IRI is a sequence of characters from the 373 Universal Character Set [ISO10646]. A mapping from IRIs to URIs 374 is defined, which means that IRIs can be used instead of URIs 375 where appropriate to identify resources." 377 The IRI specification applies the "replace", "map" and "subsume" 378 strategies for expansion outlined above. As noted in the quoted text 379 from the IRI document, IRIs are defined as a new protocol element 380 ("replace"). Therefore, any protocol or message format defined in 381 the future may use an IRI protocol element and not a URI protocol 382 element. However, as URIs are ubiquitous and IRIs would face steep 383 deployment challenges without the possibility of relating to URIs. 384 Therefore, [2] defines a mapping strategy to ensure IRIs can be 385 mapped onto URIs and vice versa. 387 The IRI document also goes on to note that there are specifications 388 already designated to handle IRIs -- "anyURI" in XML Schema. This is 389 an example of subsumption. 391 While the IRI document is clear that conversions between IRI and URI 392 formats must be made when transitioning from systems that understand 393 IRIs to ones that do not, it is unclear how message parsers that 394 detect and interpret "http://" as a URI will recognize IRIs as 395 distinct from (malformed) URIs. 397 7. Security Considerations 399 Any protocol processing which depends on a specific set of tokens or 400 structure is at risk when the matching and sorting rules for the set 401 is indeterminate. In some cases, this can result in a denial of 402 service, as legitimate tokens are not recognized; in other cases, 403 inappropriate access may be granted by matching incorrectly. 405 8. IANA Considerations 407 There are no IANA considerations defined in this memo. 409 9. Acknowledgements 411 The authors would like to thank Martin Duerst for his attention and 412 expertise. 414 References 416 [1] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform Resource 417 Identifiers (URI): Generic Syntax", RFC 2396, August 1998. 419 [2] Duerst, M. and M. Suignard, "Internationalized Resource 420 Identifiers (IRIs)", draft-duerst-iri-05.txt (work in progress), 421 October 2003. 423 Authors' Addresses 425 Leslie Daigle 426 Editor 428 Ted Hardie 429 Editor 431 Internet Architecture Board 432 IAB 434 EMail: iab@iab.org 436 Full Copyright Statement 438 Copyright (C) The Internet Society (2004). All Rights Reserved. 440 This document and translations of it may be copied and furnished to 441 others, and derivative works that comment on or otherwise explain it 442 or assist in its implementation may be prepared, copied, published 443 and distributed, in whole or in part, without restriction of any 444 kind, provided that the above copyright notice and this paragraph are 445 included on all such copies and derivative works. However, this 446 document itself may not be modified in any way, such as by removing 447 the copyright notice or references to the Internet Society or other 448 Internet organizations, except as needed for the purpose of 449 developing Internet standards in which case the procedures for 450 copyrights defined in the Internet Standards process must be 451 followed, or as required to translate it into languages other than 452 English. 454 The limited permissions granted above are perpetual and will not be 455 revoked by the Internet Society or its successors or assigns. 457 This document and the information contained herein is provided on an 458 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 459 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 460 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 461 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 462 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 464 Acknowledgement 466 Funding for the RFC Editor function is currently provided by the 467 Internet Society.