idnits 2.17.1 draft-taylor-uuid-ncname-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == The 'Updates: ' line in the draft header should list only the _numbers_ of the RFCs which will be updated by this document (if approved); it should not include the word 'RFC' in the list. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (26 July 2020) is 1370 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'A-Pa-p' is mentioned on line 154, but not defined == Missing Reference: '2-7A-Za-z' is mentioned on line 155, but not defined -- Looks like a reference, but probably isn't: '1' on line 291 -- Looks like a reference, but probably isn't: '2' on line 288 -- Looks like a reference, but probably isn't: '3' on line 284 Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group D. Taylor 3 Internet-Draft Independent 4 Updates: RFC4122 (if approved) 26 July 2020 5 Intended status: Informational 6 Expires: 27 January 2021 8 Compact, Grammar-Friendly Representations for UUIDs 9 draft-taylor-uuid-ncname-00 11 Abstract 13 The Universally Unique Identifier is a suitable standard for, as the 14 name suggests, uniquely identifying entities in a symbol space large 15 enough that the identifiers do not collide. The literal 16 representation, however, specified in RFC 4122 and elsewhere, cannot 17 be used in conjunction with a number of formal grammars where it 18 would be beneficial to do so. This document provides the UUID with 19 two additional representations to make these applications possible. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at https://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on 27 January 2021. 38 Copyright Notice 40 Copyright (c) 2020 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 45 license-info) in effect on the date of publication of this document. 46 Please review these documents carefully, as they describe your rights 47 and restrictions with respect to this document. 49 Table of Contents 51 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 52 1.1. Motivation & Applications . . . . . . . . . . . . . . . . 3 53 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 54 3. Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 4. Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 56 4.1. Recognizing UUID-NCName Symbols . . . . . . . . . . . . . 4 57 4.2. Equivalency . . . . . . . . . . . . . . . . . . . . . . . 5 58 5. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 5 59 5.1. Encoding Algorithm . . . . . . . . . . . . . . . . . . . 5 60 5.2. Decoding Algorithm . . . . . . . . . . . . . . . . . . . 6 61 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 62 7. Security Considerations . . . . . . . . . . . . . . . . . . . 7 63 8. Normative References . . . . . . . . . . . . . . . . . . . . 7 64 9. Informative References . . . . . . . . . . . . . . . . . . . 7 65 Appendix A. Samples . . . . . . . . . . . . . . . . . . . . . . 8 66 Appendix B. Implementations . . . . . . . . . . . . . . . . . . 9 67 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 9 69 1. Introduction 71 There are a number of places in formal languages where it would be 72 useful to put UUIDs, but the grammar forbids it. Many grammars 73 forbid identifiers to begin with numbers, or contain hyphens, or 74 contain colons (as with the URN representation in RFC 4122 75 [RFC4122]). The NCName production [XML-NAMES], which is pervasive in 76 XML and RDF applications, is one such example. Up until a recent 77 change, the HTML ID production had similar constraints. Virtually 78 every programming language likewise requires identifiers such as 79 variables and function names to start with a letter or underscore, 80 and very few admit hyphens. This constraint causes developers to 81 turn to ad-hoc solutions when they want to use UUIDs in these places. 83 This document specifies a representation - or rather, two 84 representations - as well as the related transformations to and from 85 the familiar UUID format. A provisional name for these 86 representations is _UUID-NCName_, with the two variants styled as 87 _UUID-NCName-32_ and _UUID-NCName-64_, referring to the base of their 88 respective encodings. The goal of this specification is in part to 89 eliminate an extra decision on the part of developers who find 90 themselves in this position, and in part to provide alternative 91 representations for UUIDs which remain valid but are shorter than the 92 original. 94 1.1. Motivation & Applications 96 The purpose of an identifier in general is to pick out some 97 information resource or other, such that it can be referred to, 98 ideally unambiguously. The purpose of a large, generated identifier 99 like the UUID, is to satisfy the uniqueness criterion while also 100 specifying a datatype and normal form for said identifiers, and 101 ultimately alleviate the need to sit down and think these identifiers 102 up. Why one would want to go inserting UUIDs in places they wouldn't 103 otherwise fit, is so these UUIDs can be cross-referenced in some 104 other database where they _do_ fit. Consider: 106 * A component content management system that uses UUIDs to identify 107 elementary content components, uses the UUID-NCName-64 108 representations of the same UUIDs as fragment identifiers for when 109 those components are transcluded. 111 * A literate programming system uses the UUID-NCName-32 112 representation as stable identifiers for all symbols (variables, 113 constants, class names, etc.), enabling said identifiers to be 114 defined and described elsewhere, while still yielding 115 syntactically-correct code. 117 2. Terminology 119 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 120 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 121 "OPTIONAL" in this document are to be interpreted as described in BCP 122 14 [RFC2119] [RFC8174] when, and only when, they appear in all 123 capitals, as shown here. 125 3. Strategy 127 Not all 128 bits of a UUID are data; rather, several bits are masked. 128 The top four bits of the third segment, known as 129 "time_hi_and_version", specify the UUID's version, which is fixed. 130 Up to three high bits in the following segment, called 131 "clock_seq_hi_and_reserved", specify the variant: how the UUID - if 132 applicable - is meant to be read. We remove these masked quartets 133 (we take an extra bit for the variant) and use them as "bookends" for 134 the rest of the identifier, mapping them to the first sixteen symbols 135 of the Base32 table [RFC4648], which are all letters. The remaining 136 120 bits, which we bit-shift to close the gaps of the two masked 137 quartets we removed, now divide evenly by both 5 and 6, the number of 138 bits per character in Base32 and Base64, respectively. 140 The transformation takes the UUID 141 "4abc6330-f548-4e67-b9f9-12d4323769cd", and returns the result 142 "ESrxjMPVI5nn5EtQyN2nNL" for base64, and "ejk6ggmhvjdtht6is2qzdo2onl" 143 for base32. These symbols will always start and end with case- 144 insensitive letters, and the entire base32 symbol is case- 145 insensitive. 147 4. Syntax 149 Here is the ABNF grammar for the productions "uuid-ncname-32" and 150 "uuid-ncname-64": 152 uuid-ncname-32 = bookend 24base32 bookend 153 uuid-ncname-64 = bookend 20base64url bookend 154 bookend = %x41-50 / %x61-70 ; [A-Pa-p] 155 base32 = %x32-37 / %x41-5a / %x61-7a ; [2-7A-Za-z] 156 base64url = %x2d / %x30-39 / %x41-5a / %x5f / %x61-7a 157 ; [-0-9A-Z_a-z] 159 "Bookends" are 4-bit sequences (nybbles, quartets, etc.) which we map 160 directly onto the Base32 table from [RFC4648]. Indeed the this 161 portion of the Base64 table is identical, though we say Base32 to 162 underscore the fact that bookend characters are case-insensitive. 163 Certain environments encode meaning into the case of the first 164 character of a symbol, so it is important that its literal 165 representation be flexible. There is likewise little value in 166 arbitrarily constraining the last character. Nevertheless, UUID- 167 NCName-64 symbols SHOULD be generated with upper-case bookend 168 characters, while UUID-NCName-32 bookends (and indeed the entire 169 symbol) SHOULD be lower-case. 171 4.1. Recognizing UUID-NCName Symbols 173 UUID-NCName symbols always have a fixed length and certain 174 characteristics: UUID-NCName-32 symbols are always exactly 26 175 characters long while UUID-NCName-64 symbols are always 22 characters 176 long. The version (first bookend character) is mapped to the Base32 177 table where "A" is 0, so "B" is 1, etc. Random (version 4) UUIDs 178 will therefore always start with the letter "E". Any value higher 179 than "F" (version 5/truncated SHA-1 UUID) is unspecified (though 180 there is room for future UUID specifications to go all the way up to 181 version 15). Likewise the variant bit-mask defined in [RFC4122] will 182 cause the symbol to always end, modulo upper/lower-case, in "I", "J", 183 "K", or "L" (8, 9, 10, 11). 185 4.2. Equivalency 187 Two UUID-NCName symbols are necessarily identical if they produce the 188 same UUID. Two UUID-NCName-32 symbols are identical if their string 189 values match when normalized to all upper- or lower-case letters. 190 Two UUID-NCName-64 symbols are identical if their string values match 191 when the bookend characters are normalized to either upper- or lower- 192 case. 194 5. Algorithms 196 These are candidate algorithms for encoding and decoding the symbols, 197 transforming them to and from the conventional UUID representation. 198 There are certainly many equivalents. 200 5.1. Encoding Algorithm 202 First we apply the shifting algorithm: 204 1. Convert the UUID to a binary string "bin". 206 2. Convert "bin" to an array of four 32-bit unsigned network-endian 207 integers "ints". 209 3. Extract "version" as "(ints[1] & 0x0000f000) >> 12". 211 4. Extract "variant" as "(ints[2] & 0xf0000000) >> 24". 213 5. Assign "ints[1] = (ints[1] & 0xffff0000) | ((ints[1] & 214 0x00000fff) << 4) | ((ints[2] & 0x0fffffff) >> 24)". 216 6. Assign "ints[2] = (ints[2] & 0x00ffffff) << 8 | (ints[3] >> 24)". 218 7. Assign "ints[3] = (ints[3] << 8) | variant". 220 8. Convert "ints" back into a binary string and return it along with 221 the "version". 223 Then one of the formatting algorithms, here is Base64: 225 1. Take the binary string "bin" and shift the last octet to the 226 right by two bits. 228 2. Encode "bin" with the base64url algorithm to get the string 229 "b64". 231 3. Truncate "b64" to 21 characters. 233 4. Convert "version" to its value in the base32 table. 235 5. return "version" concatenated to "b64". 237 And Base32: 239 1. Take the binary string "bin" and shift the last octet to the 240 right by one bit. 242 2. Encode "bin" with the base32 algorithm to get the string "b32". 244 3. Truncate "b32" to 25 characters. 246 4. Convert "version" to its value in the Base32 table. 248 5. Return "version" concatenated to "b32", optionally in either 249 upper or lower case. 251 5.2. Decoding Algorithm 253 1. First verify the syntax and determine whether the symbol "ncname" 254 is base32 or base64. 256 2. If "ncname" is base64 and the last character is lowercase, set it 257 to uppercase. 259 3. Remove the first character of the symbol "ncname" and convert it 260 into an integer according to the base32 spec; call that integer 261 "version". 263 4. Append padding if necessary to satisfy the decoder, "A======" for 264 Base32 and "A==" for Base64. 266 5. Decode the remainder of "ncname" by either the base32 or 267 base64url decoding algorithm into binary string "bin". 269 6. If "ncname" was base32, shift the last octet of "bin" one bit to 270 the left; if base64 shift it two bits. 272 Now we apply the shifting algorithm in reverse: 274 1. Ensure "version" is in the range of 0-15 by masking it with 275 "0xf". 277 2. Convert the binary string "bin" into four 32-bit unsigned 278 network-endian integers "ints". 280 3. Assign "variant = (ints[3] & 0xf0) << 24". 282 4. Shift and assign "ints[3] >>= 8". 284 5. Union and assign "ints[3] |= ((ints[2] & 0xff) << 24)". 286 6. Shift and assign "ints[2] >>= 8". 288 7. Union and assign "ints[2] |= ((ints[1] & 0xf) << 24) | variant". 290 8. Assign "ints[1] = (ints[1] & 0xffff0000) | (version << 12) | 291 ((ints[1] >> 4) & 0xfff)". 293 9. Convert "ints" back into the new binary string "bin". 295 10. Format "bin" as a UUID. 297 6. IANA Considerations 299 There are no discernible IANA considerations associated with this 300 specification. 302 7. Security Considerations 304 As UUID-NCName symbols are isomorphic to their conventional UUID 305 representations, the security considerations for these symbols also 306 the same as [RFC4122], though we repeat here the admonition not to 307 assume that UUIDs are hard to guess. 309 8. Normative References 311 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 312 Requirement Levels", BCP 14, RFC 2119, 313 DOI 10.17487/RFC2119, March 1997, 314 . 316 [RFC4122] Leach, P., Mealling, M., and R. Salz, "A Universally 317 Unique IDentifier (UUID) URN Namespace", RFC 4122, 318 DOI 10.17487/RFC4122, July 2005, 319 . 321 [RFC4648] Josefsson, S., "The Base16, Base32, and Base64 Data 322 Encodings", RFC 4648, DOI 10.17487/RFC4648, October 2006, 323 . 325 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 326 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 327 May 2017, . 329 9. Informative References 331 [XML-NAMES] 332 Bray, T., Hollander, D., Layman, A., Tobin, R., and H S. 333 Thompson, "Namespaces in XML 1.0 (Third Edition)", 8 334 December 2009, 335 . 337 Appendix A. Samples 339 +===================+======================================+ 340 | Version | Canonical UUID Representation | 341 +===================+======================================+ 342 | 0, Nil | 00000000-0000-0000-0000-000000000000 | 343 +===================+--------------------------------------+ 344 | 1, Timestamp | ca6be4c8-cbaf-11ea-b2ab-00045a86c8a1 | 345 +===================+--------------------------------------+ 346 | 2, DCE "Security" | 000003e8-cbb9-21ea-b201-00045a86c8a1 | 347 +===================+--------------------------------------+ 348 | 3, MD5 | 3d813cbb-47fb-32ba-91df-831e1593ac29 | 349 +===================+--------------------------------------+ 350 | 4, Random | 01867b2c-a0dd-459c-98d7-89e545538d6c | 351 +===================+--------------------------------------+ 352 | 5, SHA-1 | 21f7f8de-8051-5b89-8680-0195ef798b6a | 353 +===================+--------------------------------------+ 355 Table 1: Samples of canonical UUID representations 357 +============+============================+========================+ 358 | Version | Base32 | Base64 | 359 +============+============================+========================+ 360 | 0, Nil | aaaaaaaaaaaaaaaaaaaaaaaaaa | AAAAAAAAAAAAAAAAAAAAAA | 361 +============+----------------------------+------------------------+ 362 | 1, | bzjv6jsglv4pkfkyaarninsfbl | BymvkyMuvHqKrAARahsihL | 363 | Timestamp | | | 364 +============+----------------------------+------------------------+ 365 | 2, DCE | caaaah2glxepkeaiaarninsfbl | CAAAD6Mu5HqIBAARahsihL | 366 | "Security" | | | 367 +============+----------------------------+------------------------+ 368 | 3, MD5 | dhwatzo2h7mv2dx4ddykzhlbjj | DPYE8u0f7K6Hfgx4Vk6wpJ | 369 +============+----------------------------+------------------------+ 370 | 4, Random | eagdhwlfa3vm4rv4j4vcvhdlmj | EAYZ7LKDdWcjXieVFU41sJ | 371 +============+----------------------------+------------------------+ 372 | 5, SHA-1 | feh37rxuakg4jnaabsxxxtc3ki | FIff43oBRuJaAAZXveYtqI | 373 +============+----------------------------+------------------------+ 375 Table 2: Samples of UUID-NCName representations 377 Appendix B. Implementations 379 As of this writing, there are two implementations of UUID-NCName: 381 * Perl, https://metacpan.org/pod/Data::UUID::NCName 383 * Ruby, https://rubygems.org/gems/uuid-ncname 385 Author's Address 387 Dorian Taylor 388 Independent 390 Email: ietf@doriantaylor.com 391 URI: https://doriantaylor.com/