idnits 2.17.1 draft-ietf-cbor-7049bis-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: o Indefinite-length items MUST not appear. They can be encoded as definite-length items instead. -- The document date (March 02, 2018) is 2241 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '2' on line 2365 -- Looks like a reference, but probably isn't: '3' on line 2365 -- Looks like a reference, but probably isn't: '4' on line 2363 -- Looks like a reference, but probably isn't: '5' on line 2363 -- Looks like a reference, but probably isn't: '100' on line 1569 == Missing Reference: '-1' is mentioned on line 1565, but not defined -- Looks like a reference, but probably isn't: '1' on line 2642 == Missing Reference: 'RFCthis' is mentioned on line 2003, but not defined == Missing Reference: 'TM' is mentioned on line 2182, but not defined -- Looks like a reference, but probably isn't: '0' on line 2658 == Missing Reference: 'RFC4267' is mentioned on line 2810, but not defined == Missing Reference: 'CNN-TERMS' is mentioned on line 2812, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'ECMA262' ** Obsolete normative reference: RFC 5226 (Obsoleted by RFC 8126) -- Obsolete informational reference (is this intentional?): RFC 7049 (Obsoleted by RFC 8949) -- Obsolete informational reference (is this intentional?): RFC 7159 (Obsoleted by RFC 8259) Summary: 1 error (**), 0 flaws (~~), 7 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Bormann 3 Internet-Draft Universitaet Bremen TZI 4 Intended status: Standards Track P. Hoffman 5 Expires: September 3, 2018 ICANN 6 March 02, 2018 8 Concise Binary Object Representation (CBOR) 9 draft-ietf-cbor-7049bis-02 11 Abstract 13 The Concise Binary Object Representation (CBOR) is a data format 14 whose design goals include the possibility of extremely small code 15 size, fairly small message size, and extensibility without the need 16 for version negotiation. These design goals make it different from 17 earlier binary serializations such as ASN.1 and MessagePack. 19 Contributing 21 This document is being worked on in the CBOR Working Group. Please 22 contribute on the mailing list there, or in the GitHub repository for 23 this draft: https://github.com/cbor-wg/CBORbis 25 The charter for the CBOR Working Group says that the WG will update 26 RFC 7049 to fix verified errata. Security issues and clarifications 27 may be addressed, but changes to this document will ensure backward 28 compatibility for popular deployed codebases. This document will be 29 targeted at becoming an Internet Standard. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at https://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on September 3, 2018. 48 Copyright Notice 50 Copyright (c) 2018 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents 55 (https://trustee.ietf.org/license-info) in effect on the date of 56 publication of this document. Please review these documents 57 carefully, as they describe your rights and restrictions with respect 58 to this document. Code Components extracted from this document must 59 include Simplified BSD License text as described in Section 4.e of 60 the Trust Legal Provisions and are provided without warranty as 61 described in the Simplified BSD License. 63 Table of Contents 65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 66 1.1. Objectives . . . . . . . . . . . . . . . . . . . . . . . 4 67 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 68 2. CBOR Data Models . . . . . . . . . . . . . . . . . . . . . . 7 69 2.1. Extended Generic Data Models . . . . . . . . . . . . . . 7 70 2.2. Specific Data Models . . . . . . . . . . . . . . . . . . 8 71 3. Specification of the CBOR Encoding . . . . . . . . . . . . . 9 72 3.1. Major Types . . . . . . . . . . . . . . . . . . . . . . . 9 73 3.2. Indefinite Lengths for Some Major Types . . . . . . . . . 11 74 3.2.1. Indefinite-Length Arrays and Maps . . . . . . . . . . 11 75 3.2.2. Indefinite-Length Byte Strings and Text Strings . . . 14 76 3.3. Floating-Point Numbers and Values with No Content . . . . 14 77 3.4. Optional Tagging of Items . . . . . . . . . . . . . . . . 16 78 3.4.1. Date and Time . . . . . . . . . . . . . . . . . . . . 18 79 3.4.2. Bignums . . . . . . . . . . . . . . . . . . . . . . . 19 80 3.4.3. Decimal Fractions and Bigfloats . . . . . . . . . . . 19 81 3.4.4. Content Hints . . . . . . . . . . . . . . . . . . . . 21 82 3.4.4.1. Encoded CBOR Data Item . . . . . . . . . . . . . 21 83 3.4.4.2. Expected Later Encoding for CBOR-to-JSON 84 Converters . . . . . . . . . . . . . . . . . . . 21 85 3.4.4.3. Encoded Text . . . . . . . . . . . . . . . . . . 21 86 3.4.5. Self-Describe CBOR . . . . . . . . . . . . . . . . . 22 87 3.5. CBOR Data Models . . . . . . . . . . . . . . . . . . . . 22 88 4. Creating CBOR-Based Protocols . . . . . . . . . . . . . . . . 24 89 4.1. CBOR in Streaming Applications . . . . . . . . . . . . . 25 90 4.2. Generic Encoders and Decoders . . . . . . . . . . . . . . 25 91 4.3. Syntax Errors . . . . . . . . . . . . . . . . . . . . . . 26 92 4.3.1. Incomplete CBOR Data Items . . . . . . . . . . . . . 26 93 4.3.2. Malformed Indefinite-Length Items . . . . . . . . . . 27 94 4.3.3. Unknown Additional Information Values . . . . . . . . 27 95 4.4. Other Decoding Errors . . . . . . . . . . . . . . . . . . 27 96 4.5. Handling Unknown Simple Values and Tags . . . . . . . . . 28 97 4.6. Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 28 98 4.7. Specifying Keys for Maps . . . . . . . . . . . . . . . . 29 99 4.7.1. Equivalence of Keys . . . . . . . . . . . . . . . . . 30 100 4.8. Undefined Values . . . . . . . . . . . . . . . . . . . . 31 101 4.9. Canonical CBOR . . . . . . . . . . . . . . . . . . . . . 31 102 4.9.1. Length-first map key ordering . . . . . . . . . . . . 33 103 4.10. Strict Mode . . . . . . . . . . . . . . . . . . . . . . . 34 104 5. Converting Data between CBOR and JSON . . . . . . . . . . . . 36 105 5.1. Converting from CBOR to JSON . . . . . . . . . . . . . . 36 106 5.2. Converting from JSON to CBOR . . . . . . . . . . . . . . 37 107 6. Future Evolution of CBOR . . . . . . . . . . . . . . . . . . 38 108 6.1. Extension Points . . . . . . . . . . . . . . . . . . . . 38 109 6.2. Curating the Additional Information Space . . . . . . . . 39 110 7. Diagnostic Notation . . . . . . . . . . . . . . . . . . . . . 40 111 7.1. Encoding Indicators . . . . . . . . . . . . . . . . . . . 41 112 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 41 113 8.1. Simple Values Registry . . . . . . . . . . . . . . . . . 41 114 8.2. Tags Registry . . . . . . . . . . . . . . . . . . . . . . 42 115 8.3. Media Type ("MIME Type") . . . . . . . . . . . . . . . . 42 116 8.4. CoAP Content-Format . . . . . . . . . . . . . . . . . . . 43 117 8.5. The +cbor Structured Syntax Suffix Registration . . . . . 43 118 9. Security Considerations . . . . . . . . . . . . . . . . . . . 44 119 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 45 120 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 45 121 11.1. Normative References . . . . . . . . . . . . . . . . . . 45 122 11.2. Informative References . . . . . . . . . . . . . . . . . 46 123 Appendix A. Examples . . . . . . . . . . . . . . . . . . . . . . 48 124 Appendix B. Jump Table . . . . . . . . . . . . . . . . . . . . . 52 125 Appendix C. Pseudocode . . . . . . . . . . . . . . . . . . . . . 55 126 Appendix D. Half-Precision . . . . . . . . . . . . . . . . . . . 57 127 Appendix E. Comparison of Other Binary Formats to CBOR's Design 128 Objectives . . . . . . . . . . . . . . . . . . . . . 58 129 E.1. ASN.1 DER, BER, and PER . . . . . . . . . . . . . . . . . 59 130 E.2. MessagePack . . . . . . . . . . . . . . . . . . . . . . . 59 131 E.3. BSON . . . . . . . . . . . . . . . . . . . . . . . . . . 60 132 E.4. UBJSON . . . . . . . . . . . . . . . . . . . . . . . . . 60 133 E.5. MSDTP: RFC 713 . . . . . . . . . . . . . . . . . . . . . 60 134 E.6. Conciseness on the Wire . . . . . . . . . . . . . . . . . 60 135 Appendix F. Changes from RFC 7049 . . . . . . . . . . . . . . . 61 136 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 61 138 1. Introduction 140 There are hundreds of standardized formats for binary representation 141 of structured data (also known as binary serialization formats). Of 142 those, some are for specific domains of information, while others are 143 generalized for arbitrary data. In the IETF, probably the best-known 144 formats in the latter category are ASN.1's BER and DER [ASN.1]. 146 The format defined here follows some specific design goals that are 147 not well met by current formats. The underlying data model is an 148 extended version of the JSON data model [RFC7159]. It is important 149 to note that this is not a proposal that the grammar in RFC 7159 be 150 extended in general, since doing so would cause a significant 151 backwards incompatibility with already deployed JSON documents. 152 Instead, this document simply defines its own data model that starts 153 from JSON. 155 Appendix E lists some existing binary formats and discusses how well 156 they do or do not fit the design objectives of the Concise Binary 157 Object Representation (CBOR). 159 1.1. Objectives 161 The objectives of CBOR, roughly in decreasing order of importance, 162 are: 164 1. The representation must be able to unambiguously encode most 165 common data formats used in Internet standards. 167 * It must represent a reasonable set of basic data types and 168 structures using binary encoding. "Reasonable" here is 169 largely influenced by the capabilities of JSON, with the major 170 addition of binary byte strings. The structures supported are 171 limited to arrays and trees; loops and lattice-style graphs 172 are not supported. 174 * There is no requirement that all data formats be uniquely 175 encoded; that is, it is acceptable that the number "7" might 176 be encoded in multiple different ways. 178 2. The code for an encoder or decoder must be able to be compact in 179 order to support systems with very limited memory, processor 180 power, and instruction sets. 182 * An encoder and a decoder need to be implementable in a very 183 small amount of code (for example, in class 1 constrained 184 nodes as defined in [RFC7228]). 186 * The format should use contemporary machine representations of 187 data (for example, not requiring binary-to-decimal 188 conversion). 190 3. Data must be able to be decoded without a schema description. 192 * Similar to JSON, encoded data should be self-describing so 193 that a generic decoder can be written. 195 4. The serialization must be reasonably compact, but data 196 compactness is secondary to code compactness for the encoder and 197 decoder. 199 * "Reasonable" here is bounded by JSON as an upper bound in 200 size, and by implementation complexity maintaining a lower 201 bound. Using either general compression schemes or extensive 202 bit-fiddling violates the complexity goals. 204 5. The format must be applicable to both constrained nodes and high- 205 volume applications. 207 * This means it must be reasonably frugal in CPU usage for both 208 encoding and decoding. This is relevant both for constrained 209 nodes and for potential usage in applications with a very high 210 volume of data. 212 6. The format must support all JSON data types for conversion to and 213 from JSON. 215 * It must support a reasonable level of conversion as long as 216 the data represented is within the capabilities of JSON. It 217 must be possible to define a unidirectional mapping towards 218 JSON for all types of data. 220 7. The format must be extensible, and the extended data must be 221 decodable by earlier decoders. 223 * The format is designed for decades of use. 225 * The format must support a form of extensibility that allows 226 fallback so that a decoder that does not understand an 227 extension can still decode the message. 229 * The format must be able to be extended in the future by later 230 IETF standards. 232 1.2. Terminology 234 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 235 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 236 document are to be interpreted as described in RFC 2119, BCP 14 237 [RFC2119] and indicate requirement levels for compliant CBOR 238 implementations. 240 The term "byte" is used in its now-customary sense as a synonym for 241 "octet". All multi-byte values are encoded in network byte order 242 (that is, most significant byte first, also known as "big-endian"). 244 This specification makes use of the following terminology: 246 Data item: A single piece of CBOR data. The structure of a data 247 item may contain zero, one, or more nested data items. The term 248 is used both for the data item in representation format and for 249 the abstract idea that can be derived from that by a decoder. 251 Decoder: A process that decodes a CBOR data item and makes it 252 available to an application. Formally speaking, a decoder 253 contains a parser to break up the input using the syntax rules of 254 CBOR, as well as a semantic processor to prepare the data in a 255 form suitable to the application. 257 Encoder: A process that generates the representation format of a 258 CBOR data item from application information. 260 Data Stream: A sequence of zero or more data items, not further 261 assembled into a larger containing data item. The independent 262 data items that make up a data stream are sometimes also referred 263 to as "top-level data items". 265 Well-formed: A data item that follows the syntactic structure of 266 CBOR. A well-formed data item uses the initial bytes and the byte 267 strings and/or data items that are implied by their values as 268 defined in CBOR and is not followed by extraneous data. 270 Valid: A data item that is well-formed and also follows the semantic 271 restrictions that apply to CBOR data items. 273 Stream decoder: A process that decodes a data stream and makes each 274 of the data items in the sequence available to an application as 275 they are received. 277 Where bit arithmetic or data types are explained, this document uses 278 the notation familiar from the programming language C, except that 279 "**" denotes exponentiation. Similar to the "0x" notation for 280 hexadecimal numbers, numbers in binary notation are prefixed with 281 "0b". Underscores can be added to such a number solely for 282 readability, so 0b00100001 (0x21) might be written 0b001_00001 to 283 emphasize the desired interpretation of the bits in the byte; in this 284 case, it is split into three bits and five bits. 286 2. CBOR Data Models 288 CBOR is explicit about its generic data model, which defines the set 289 of all data items that can be represented in CBOR. Its basic generic 290 data model is extensible by the registration of simple type values 291 and tags. Applications can then subset the resulting extended 292 generic data model to build their specific data models. 294 Within environments that can represent the data items in the generic 295 data model, generic CBOR encoders and decoders can be implemented 296 (which usually involves defining additional implementation data types 297 for those data items that do not already have a natural 298 representation in the environment). The ability to provide generic 299 encoders and decoders is an explicit design goal of CBOR; however 300 many applications will provide their own application-specific 301 encoders and/or decoders. 303 In the basic (un-extended) generic data model, a data item is one of: 305 o an integer in the range -2**64..2**64-1 inclusive 307 o a simple value, identified by a number between 0 and 255, but 308 distinct from that number 310 o a floating point value, distinct from an integer, out of the set 311 representable by IEEE 754 binary64 (including non-finites) 313 o a sequence of zero or more bytes ("byte string") 315 o a sequence of zero or more Unicode code points ("text string") 317 o a sequence of zero or more data items ("array") 319 o a mapping (mathematical function) from zero or more data items 320 ("keys") each to a data item ("values"), ("map") 322 o a tagged data item, comprising a tag (an integer in the range 323 0..2**64-1) and a value (a data item) 325 Note that integer and floating-point values are distinct in this 326 model, even if they have the same numeric value. 328 2.1. Extended Generic Data Models 330 This basic generic data model comes pre-extended by the registration 331 of a number of simple values and tags right in this document, such 332 as: 334 o "false", "true", "null", and "undefined" (simple values identified 335 by 20..23) 337 o integer and floating point values with a larger range and 338 precision than the above (tags 2 to 5) 340 o application data types such as a point in time (tags 1, 0) 342 Further elements of the extended generic data model can be (and have 343 been) defined via the IANA registries created for CBOR. Even if such 344 an extension is unknown to a generic encoder or decoder, data items 345 using that extension can be passed to or from the application by 346 representing them at the interface to the application within the 347 basic generic data model, i.e., as generic values of a simple type or 348 generic tagged items. 350 In other words, the basic generic data model is stable as defined in 351 this document, while the extended generic data model expands by the 352 registration of new simple values or tags, but never shrinks. 354 While there is a strong expectation that generic encoders and 355 decoders can represent "false", "true", and "null" in the form 356 appropriate for their programming environment, implementation of the 357 data model extensions created by tags is truly optional and a matter 358 of implementation quality. 360 2.2. Specific Data Models 362 The specific data model for a CBOR-based protocol usually subsets the 363 extended generic data model and assigns application semantics to the 364 data items within this subset and its components. When documenting 365 such specific data models, where it is desired to specify the types 366 of data items, it is preferred to identify the types by their names 367 in the generic data model ("negative integer", "array") instead of by 368 referring to aspects of their CBOR representation ("major type 1", 369 "major type 4"). 371 Specific data models can also specify that values of different types 372 are equivalent for the purposes of map keys and encoder freedom. For 373 example, in the generic data model, a valid map MAY have both "0" and 374 "0.0" as keys, and an encoder MUST NOT encode "0.0" as an integer 375 (major type 0, Section 3.1). However, if a specific data model 376 declares that floating point and integer representations of integral 377 values are equivalent, map keys "0" and "0.0" would be considered 378 duplicates and so invalid, and an encoder could encode integral- 379 valued floats as integers or vice versa, perhaps to save encoded 380 bytes. 382 3. Specification of the CBOR Encoding 384 A CBOR data item (Section 2) is encoded to or decoded from a byte 385 string as described in this section. The encoding is summarized in 386 Table 5. 388 The initial byte of each data item contains both information about 389 the major type (the high-order 3 bits, described in Section 3.1) and 390 additional information (the low-order 5 bits). When the value of the 391 additional information is less than 24, it is directly used as a 392 small unsigned integer. When it is 24 to 27, the additional bytes 393 for a variable-length integer immediately follow; the values 24 to 27 394 of the additional information specify that its length is a 1-, 2-, 395 4-, or 8-byte unsigned integer, respectively. Additional information 396 value 31 is used for indefinite-length items, described in 397 Section 3.2. Additional information values 28 to 30 are reserved for 398 future expansion. 400 In all additional information values, the resulting integer is 401 interpreted depending on the major type. It may represent the actual 402 data: for example, in integer types, the resulting integer is used 403 for the value itself. It may instead supply length information: for 404 example, in byte strings it gives the length of the byte string data 405 that follows. 407 A CBOR decoder implementation can be based on a jump table with all 408 256 defined values for the initial byte (Table 5). A decoder in a 409 constrained implementation can instead use the structure of the 410 initial byte and following bytes for more compact code (see 411 Appendix C for a rough impression of how this could look). 413 3.1. Major Types 415 The following lists the major types and the additional information 416 and other bytes associated with the type. 418 Major type 0: an unsigned integer. The 5-bit additional information 419 is either the integer itself (for additional information values 0 420 through 23) or the length of additional data. Additional 421 information 24 means the value is represented in an additional 422 uint8_t, 25 means a uint16_t, 26 means a uint32_t, and 27 means a 423 uint64_t. For example, the integer 10 is denoted as the one byte 424 0b000_01010 (major type 0, additional information 10). The 425 integer 500 would be 0b000_11001 (major type 0, additional 426 information 25) followed by the two bytes 0x01f4, which is 500 in 427 decimal. 429 Major type 1: a negative integer. The encoding follows the rules 430 for unsigned integers (major type 0), except that the value is 431 then -1 minus the encoded unsigned integer. For example, the 432 integer -500 would be 0b001_11001 (major type 1, additional 433 information 25) followed by the two bytes 0x01f3, which is 499 in 434 decimal. 436 Major type 2: a byte string. The string's length in bytes is 437 represented following the rules for positive integers (major type 438 0). For example, a byte string whose length is 5 would have an 439 initial byte of 0b010_00101 (major type 2, additional information 440 5 for the length), followed by 5 bytes of binary content. A byte 441 string whose length is 500 would have 3 initial bytes of 442 0b010_11001 (major type 2, additional information 25 to indicate a 443 two-byte length) followed by the two bytes 0x01f4 for a length of 444 500, followed by 500 bytes of binary content. 446 Major type 3: a text string, specifically a string of Unicode 447 characters that is encoded as UTF-8 [RFC3629]. The format of this 448 type is identical to that of byte strings (major type 2), that is, 449 as with major type 2, the length gives the number of bytes. This 450 type is provided for systems that need to interpret or display 451 human-readable text, and allows the differentiation between 452 unstructured bytes and text that has a specified repertoire and 453 encoding. In contrast to formats such as JSON, the Unicode 454 characters in this type are never escaped. Thus, a newline 455 character (U+000A) is always represented in a string as the byte 456 0x0a, and never as the bytes 0x5c6e (the characters "\" and "n") 457 or as 0x5c7530303061 (the characters "\", "u", "0", "0", "0", and 458 "a"). 460 Major type 4: an array of data items. Arrays are also called lists, 461 sequences, or tuples. The array's length follows the rules for 462 byte strings (major type 2), except that the length denotes the 463 number of data items, not the length in bytes that the array takes 464 up. Items in an array do not need to all be of the same type. 465 For example, an array that contains 10 items of any type would 466 have an initial byte of 0b100_01010 (major type of 4, additional 467 information of 10 for the length) followed by the 10 remaining 468 items. 470 Major type 5: a map of pairs of data items. Maps are also called 471 tables, dictionaries, hashes, or objects (in JSON). A map is 472 comprised of pairs of data items, each pair consisting of a key 473 that is immediately followed by a value. The map's length follows 474 the rules for byte strings (major type 2), except that the length 475 denotes the number of pairs, not the length in bytes that the map 476 takes up. For example, a map that contains 9 pairs would have an 477 initial byte of 0b101_01001 (major type of 5, additional 478 information of 9 for the number of pairs) followed by the 18 479 remaining items. The first item is the first key, the second item 480 is the first value, the third item is the second key, and so on. 481 A map that has duplicate keys may be well-formed, but it is not 482 valid, and thus it causes indeterminate decoding; see also 483 Section 4.7. 485 Major type 6: optional semantic tagging of other major types. See 486 Section 3.4. 488 Major type 7: floating-point numbers and simple data types that need 489 no content, as well as the "break" stop code. See Section 3.3. 491 These eight major types lead to a simple table showing which of the 492 256 possible values for the initial byte of a data item are used 493 (Table 5). 495 In major types 6 and 7, many of the possible values are reserved for 496 future specification. See Section 8 for more information on these 497 values. 499 3.2. Indefinite Lengths for Some Major Types 501 Four CBOR items (arrays, maps, byte strings, and text strings) can be 502 encoded with an indefinite length using additional information value 503 31. This is useful if the encoding of the item needs to begin before 504 the number of items inside the array or map, or the total length of 505 the string, is known. (The application of this is often referred to 506 as "streaming" within a data item.) 508 Indefinite-length arrays and maps are dealt with differently than 509 indefinite-length byte strings and text strings. 511 3.2.1. Indefinite-Length Arrays and Maps 513 Indefinite-length arrays and maps are simply opened without 514 indicating the number of data items that will be included in the 515 array or map, using the additional information value of 31. The 516 initial major type and additional information byte is followed by the 517 elements of the array or map, just as they would be in other arrays 518 or maps. The end of the array or map is indicated by encoding a 519 "break" stop code in a place where the next data item would normally 520 have been included. The "break" is encoded with major type 7 and 521 additional information value 31 (0b111_11111) but is not itself a 522 data item: it is just a syntactic feature to close the array or map. 523 That is, the "break" stop code comes after the last item in the array 524 or map, and it cannot occur anywhere else in place of a data item. 526 In this way, indefinite-length arrays and maps look identical to 527 other arrays and maps except for beginning with the additional 528 information value 31 and ending with the "break" stop code. 530 Arrays and maps with indefinite lengths allow any number of items 531 (for arrays) and key/value pairs (for maps) to be given before the 532 "break" stop code. There is no restriction against nesting 533 indefinite-length array or map items. A "break" only terminates a 534 single item, so nested indefinite-length items need exactly as many 535 "break" stop codes as there are type bytes starting an indefinite- 536 length item. 538 For example, assume an encoder wants to represent the abstract array 539 [1, [2, 3], [4, 5]]. The definite-length encoding would be 540 0x8301820203820405: 542 83 -- Array of length 3 543 01 -- 1 544 82 -- Array of length 2 545 02 -- 2 546 03 -- 3 547 82 -- Array of length 2 548 04 -- 4 549 05 -- 5 551 Indefinite-length encoding could be applied independently to each of 552 the three arrays encoded in this data item, as required, leading to 553 representations such as: 555 0x9f018202039f0405ffff 556 9F -- Start indefinite-length array 557 01 -- 1 558 82 -- Array of length 2 559 02 -- 2 560 03 -- 3 561 9F -- Start indefinite-length array 562 04 -- 4 563 05 -- 5 564 FF -- "break" (inner array) 565 FF -- "break" (outer array) 567 0x9f01820203820405ff 568 9F -- Start indefinite-length array 569 01 -- 1 570 82 -- Array of length 2 571 02 -- 2 572 03 -- 3 573 82 -- Array of length 2 574 04 -- 4 575 05 -- 5 576 FF -- "break" 578 0x83018202039f0405ff 579 83 -- Array of length 3 580 01 -- 1 581 82 -- Array of length 2 582 02 -- 2 583 03 -- 3 584 9F -- Start indefinite-length array 585 04 -- 4 586 05 -- 5 587 FF -- "break" 589 0x83019f0203ff820405 590 83 -- Array of length 3 591 01 -- 1 592 9F -- Start indefinite-length array 593 02 -- 2 594 03 -- 3 595 FF -- "break" 596 82 -- Array of length 2 597 04 -- 4 598 05 -- 5 600 An example of an indefinite-length map (that happens to have two key/ 601 value pairs) might be: 603 0xbf6346756ef563416d7421ff 604 BF -- Start indefinite-length map 605 63 -- First key, UTF-8 string length 3 606 46756e -- "Fun" 607 F5 -- First value, true 608 63 -- Second key, UTF-8 string length 3 609 416d74 -- "Amt" 610 21 -- Second value, -2 611 FF -- "break" 613 3.2.2. Indefinite-Length Byte Strings and Text Strings 615 Indefinite-length byte strings and text strings are actually a 616 concatenation of zero or more definite-length byte or text strings 617 ("chunks") that are together treated as one contiguous string. 618 Indefinite-length strings are opened with the major type and 619 additional information value of 31, but what follows are a series of 620 byte or text strings that have definite lengths (the chunks). The 621 end of the series of chunks is indicated by encoding the "break" stop 622 code (0b111_11111) in a place where the next chunk in the series 623 would occur. The contents of the chunks are concatenated together, 624 and the overall length of the indefinite-length string will be the 625 sum of the lengths of all of the chunks. In summary, an indefinite- 626 length string is encoded similarly to how an indefinite-length array 627 of its chunks would be encoded, except that the major type of the 628 indefinite-length string is that of a (text or byte) string and 629 matches the major types of its chunks. 631 For indefinite-length byte strings, every data item (chunk) between 632 the indefinite-length indicator and the "break" MUST be a definite- 633 length byte string item; if the parser sees any item type other than 634 a byte string before it sees the "break", it is an error. 636 For example, assume the sequence: 638 0b010_11111 0b010_00100 0xaabbccdd 0b010_00011 0xeeff99 0b111_11111 640 5F -- Start indefinite-length byte string 641 44 -- Byte string of length 4 642 aabbccdd -- Bytes content 643 43 -- Byte string of length 3 644 eeff99 -- Bytes content 645 FF -- "break" 647 After decoding, this results in a single byte string with seven 648 bytes: 0xaabbccddeeff99. 650 Text strings with indefinite lengths act the same as byte strings 651 with indefinite lengths, except that all their chunks MUST be 652 definite-length text strings. Note that this implies that the bytes 653 of a single UTF-8 character cannot be spread between chunks: a new 654 chunk can only be started at a character boundary. 656 3.3. Floating-Point Numbers and Values with No Content 658 Major type 7 is for two types of data: floating-point numbers and 659 "simple values" that do not need any content. Each value of the 660 5-bit additional information in the initial byte has its own separate 661 meaning, as defined in Table 1. Like the major types for integers, 662 items of this major type do not carry content data; all the 663 information is in the initial bytes. 665 +-------------+--------------------------------------------------+ 666 | 5-Bit Value | Semantics | 667 +-------------+--------------------------------------------------+ 668 | 0..23 | Simple value (value 0..23) | 669 | | | 670 | 24 | Simple value (value 32..255 in following byte) | 671 | | | 672 | 25 | IEEE 754 Half-Precision Float (16 bits follow) | 673 | | | 674 | 26 | IEEE 754 Single-Precision Float (32 bits follow) | 675 | | | 676 | 27 | IEEE 754 Double-Precision Float (64 bits follow) | 677 | | | 678 | 28-30 | (Unassigned) | 679 | | | 680 | 31 | "break" stop code for indefinite-length items | 681 +-------------+--------------------------------------------------+ 683 Table 1: Values for Additional Information in Major Type 7 685 As with all other major types, the 5-bit value 24 signifies a single- 686 byte extension: it is followed by an additional byte to represent the 687 simple value. (To minimize confusion, only the values 32 to 255 are 688 used.) This maintains the structure of the initial bytes: as for the 689 other major types, the length of these always depends on the 690 additional information in the first byte. Table 2 lists the values 691 assigned and available for simple types. 693 +---------+-----------------+ 694 | Value | Semantics | 695 +---------+-----------------+ 696 | 0..19 | (Unassigned) | 697 | | | 698 | 20 | False | 699 | | | 700 | 21 | True | 701 | | | 702 | 22 | Null | 703 | | | 704 | 23 | Undefined value | 705 | | | 706 | 24..31 | (Reserved) | 707 | | | 708 | 32..255 | (Unassigned) | 709 +---------+-----------------+ 711 Table 2: Simple Values 713 The 5-bit values of 25, 26, and 27 are for 16-bit, 32-bit, and 64-bit 714 IEEE 754 binary floating-point values. These floating-point values 715 are encoded in the additional bytes of the appropriate size. (See 716 Appendix D for some information about 16-bit floating point.) 718 An encoder MUST NOT encode False as the two-byte sequence of 0xf814, 719 MUST NOT encode True as the two-byte sequence of 0xf815, MUST NOT 720 encode Null as the two-byte sequence of 0xf816, and MUST NOT encode 721 Undefined value as the two-byte sequence of 0xf817. A decoder MUST 722 treat these two-byte sequences as an error. Similar prohibitions 723 apply to the unassigned simple values as well. 725 3.4. Optional Tagging of Items 727 In CBOR, a data item can optionally be preceded by a tag to give it 728 additional semantics while retaining its structure. The tag is major 729 type 6, and represents an integer number as indicated by the tag's 730 integer value; the (sole) data item is carried as content data. If a 731 tag requires structured data, this structure is encoded into the 732 nested data item. The definition of a tag usually restricts what 733 kinds of nested data item or items can be carried by a tag. 735 The initial bytes of the tag follow the rules for positive integers 736 (major type 0). The tag is followed by a single data item of any 737 type. For example, assume that a byte string of length 12 is marked 738 with a tag to indicate it is a positive bignum (Section 3.4.2). This 739 would be marked as 0b110_00010 (major type 6, additional information 740 2 for the tag) followed by 0b010_01100 (major type 2, additional 741 information of 12 for the length) followed by the 12 bytes of the 742 bignum. 744 Decoders do not need to understand tags, and thus tags may be of 745 little value in applications where the implementation creating a 746 particular CBOR data item and the implementation decoding that stream 747 know the semantic meaning of each item in the data flow. Their 748 primary purpose in this specification is to define common data types 749 such as dates. A secondary purpose is to allow optional tagging when 750 the decoder is a generic CBOR decoder that might be able to benefit 751 from hints about the content of items. Understanding the semantic 752 tags is optional for a decoder; it can just jump over the initial 753 bytes of the tag and interpret the tagged data item itself. 755 A tag always applies to the item that is directly followed by it. 756 Thus, if tag A is followed by tag B, which is followed by data item 757 C, tag A applies to the result of applying tag B on data item C. 758 That is, a tagged item is a data item consisting of a tag and a 759 value. The content of the tagged item is the data item (the value) 760 that is being tagged. 762 IANA maintains a registry of tag values as described in Section 8.2. 763 Table 3 provides a list of initial values, with definitions in the 764 rest of this section. 766 +-----------+--------------+----------------------------------------+ 767 | Tag | Data Item | Semantics | 768 +-----------+--------------+----------------------------------------+ 769 | 0 | UTF-8 string | Standard date/time string; see | 770 | | | Section 3.4.1 | 771 | | | | 772 | 1 | multiple | Epoch-based date/time; see | 773 | | | Section 3.4.1 | 774 | | | | 775 | 2 | byte string | Positive bignum; see Section 3.4.2 | 776 | | | | 777 | 3 | byte string | Negative bignum; see Section 3.4.2 | 778 | | | | 779 | 4 | array | Decimal fraction; see Section 3.4.3 | 780 | | | | 781 | 5 | array | Bigfloat; see Section 3.4.3 | 782 | | | | 783 | 6..20 | (Unassigned) | (Unassigned) | 784 | | | | 785 | 21 | multiple | Expected conversion to base64url | 786 | | | encoding; see Section 3.4.4.2 | 787 | | | | 788 | 22 | multiple | Expected conversion to base64 | 789 | | | encoding; see Section 3.4.4.2 | 790 | | | | 791 | 23 | multiple | Expected conversion to base16 | 792 | | | encoding; see Section 3.4.4.2 | 793 | | | | 794 | 24 | byte string | Encoded CBOR data item; see | 795 | | | Section 3.4.4.1 | 796 | | | | 797 | 25..31 | (Unassigned) | (Unassigned) | 798 | | | | 799 | 32 | UTF-8 string | URI; see Section 3.4.4.3 | 800 | | | | 801 | 33 | UTF-8 string | base64url; see Section 3.4.4.3 | 802 | | | | 803 | 34 | UTF-8 string | base64; see Section 3.4.4.3 | 804 | | | | 805 | 35 | UTF-8 string | Regular expression; see | 806 | | | Section 3.4.4.3 | 807 | | | | 808 | 36 | UTF-8 string | MIME message; see Section 3.4.4.3 | 809 | | | | 810 | 37..55798 | (Unassigned) | (Unassigned) | 811 | | | | 812 | 55799 | multiple | Self-describe CBOR; see Section 3.4.5 | 813 | | | | 814 | 55800+ | (Unassigned) | (Unassigned) | 815 +-----------+--------------+----------------------------------------+ 817 Table 3: Values for Tags 819 3.4.1. Date and Time 821 Protocols using tag values 0 and 1 extend the generic data model 822 (Section 2) with data items representing points in time. 824 Tag value 0 is for date/time strings that follow the standard format 825 described in [RFC3339], as refined by Section 3.3 of [RFC4287]. 827 Tag value 1 is for numerical representation of seconds relative to 828 1970-01-01T00:00Z in UTC time. (For the non-negative values that the 829 Portable Operating System Interface (POSIX) defines, the number of 830 seconds is counted in the same way as for POSIX "seconds since the 831 epoch" [TIME_T].) The tagged item can be a positive or negative 832 integer (major types 0 and 1), or a floating-point number (major type 833 7 with additional information 25, 26, or 27). Note that the number 834 can be negative (time before 1970-01-01T00:00Z) and, if a floating- 835 point number, indicate fractional seconds. 837 3.4.2. Bignums 839 Protocols using tag values 2 and 3 extend the generic data model 840 (Section 2) with "bignums" representing arbitrary integers. In the 841 generic data model, bignum values are not equal to integers from the 842 basic data model, but specific data models can define that 843 equivalence. 845 Bignums are encoded as a byte string data item, which is interpreted 846 as an unsigned integer n in network byte order. For tag value 2, the 847 value of the bignum is n. For tag value 3, the value of the bignum 848 is -1 - n. Decoders that understand these tags MUST be able to 849 decode bignums that have leading zeroes. 851 For example, the number 18446744073709551616 (2**64) is represented 852 as 0b110_00010 (major type 6, tag 2), followed by 0b010_01001 (major 853 type 2, length 9), followed by 0x010000000000000000 (one byte 0x01 854 and eight bytes 0x00). In hexadecimal: 856 C2 -- Tag 2 857 49 -- Byte string of length 9 858 010000000000000000 -- Bytes content 860 3.4.3. Decimal Fractions and Bigfloats 862 Protocols using tag value 4 extend the generic data model with data 863 items representing arbitrary-length decimal fractions m*(10*e). 864 Protocols using tag value 5 extend the generic data model with data 865 items representing arbitrary-length binary fractions m*(2*e). As 866 with bignums, values of different types are not equal in the generic 867 data model. 869 Decimal fractions combine an integer mantissa with a base-10 scaling 870 factor. They are most useful if an application needs the exact 871 representation of a decimal fraction such as 1.1 because there is no 872 exact representation for many decimal fractions in binary floating 873 point. 875 Bigfloats combine an integer mantissa with a base-2 scaling factor. 876 They are binary floating-point values that can exceed the range or 877 the precision of the three IEEE 754 formats supported by CBOR 878 (Section 3.3). Bigfloats may also be used by constrained 879 applications that need some basic binary floating-point capability 880 without the need for supporting IEEE 754. 882 A decimal fraction or a bigfloat is represented as a tagged array 883 that contains exactly two integer numbers: an exponent e and a 884 mantissa m. Decimal fractions (tag 4) use base-10 exponents; the 885 value of a decimal fraction data item is m*(10**e). Bigfloats (tag 886 5) use base-2 exponents; the value of a bigfloat data item is 887 m*(2**e). The exponent e MUST be represented in an integer of major 888 type 0 or 1, while the mantissa also can be a bignum (Section 3.4.2). 890 An example of a decimal fraction is that the number 273.15 could be 891 represented as 0b110_00100 (major type of 6 for the tag, additional 892 information of 4 for the type of tag), followed by 0b100_00010 (major 893 type of 4 for the array, additional information of 2 for the length 894 of the array), followed by 0b001_00001 (major type of 1 for the first 895 integer, additional information of 1 for the value of -2), followed 896 by 0b000_11001 (major type of 0 for the second integer, additional 897 information of 25 for a two-byte value), followed by 898 0b0110101010110011 (27315 in two bytes). In hexadecimal: 900 C4 -- Tag 4 901 82 -- Array of length 2 902 21 -- -2 903 19 6ab3 -- 27315 905 An example of a bigfloat is that the number 1.5 could be represented 906 as 0b110_00101 (major type of 6 for the tag, additional information 907 of 5 for the type of tag), followed by 0b100_00010 (major type of 4 908 for the array, additional information of 2 for the length of the 909 array), followed by 0b001_00000 (major type of 1 for the first 910 integer, additional information of 0 for the value of -1), followed 911 by 0b000_00011 (major type of 0 for the second integer, additional 912 information of 3 for the value of 3). In hexadecimal: 914 C5 -- Tag 5 915 82 -- Array of length 2 916 20 -- -1 917 03 -- 3 919 Decimal fractions and bigfloats provide no representation of 920 Infinity, -Infinity, or NaN; if these are needed in place of a 921 decimal fraction or bigfloat, the IEEE 754 half-precision 922 representations from Section 3.3 can be used. For constrained 923 applications, where there is a choice between representing a specific 924 number as an integer and as a decimal fraction or bigfloat (such as 925 when the exponent is small and non-negative), there is a quality-of- 926 implementation expectation that the integer representation is used 927 directly. 929 3.4.4. Content Hints 931 The tags in this section are for content hints that might be used by 932 generic CBOR processors. These content hints do not extend the 933 generic data model. 935 3.4.4.1. Encoded CBOR Data Item 937 Sometimes it is beneficial to carry an embedded CBOR data item that 938 is not meant to be decoded immediately at the time the enclosing data 939 item is being parsed. Tag 24 (CBOR data item) can be used to tag the 940 embedded byte string as a data item encoded in CBOR format. 942 3.4.4.2. Expected Later Encoding for CBOR-to-JSON Converters 944 Tags 21 to 23 indicate that a byte string might require a specific 945 encoding when interoperating with a text-based representation. These 946 tags are useful when an encoder knows that the byte string data it is 947 writing is likely to be later converted to a particular JSON-based 948 usage. That usage specifies that some strings are encoded as base64, 949 base64url, and so on. The encoder uses byte strings instead of doing 950 the encoding itself to reduce the message size, to reduce the code 951 size of the encoder, or both. The encoder does not know whether or 952 not the converter will be generic, and therefore wants to say what it 953 believes is the proper way to convert binary strings to JSON. 955 The data item tagged can be a byte string or any other data item. In 956 the latter case, the tag applies to all of the byte string data items 957 contained in the data item, except for those contained in a nested 958 data item tagged with an expected conversion. 960 These three tag types suggest conversions to three of the base data 961 encodings defined in [RFC4648]. For base64url encoding, padding is 962 not used (see Section 3.2 of RFC 4648); that is, all trailing equals 963 signs ("=") are removed from the base64url-encoded string. Later 964 tags might be defined for other data encodings of RFC 4648 or for 965 other ways to encode binary data in strings. 967 3.4.4.3. Encoded Text 969 Some text strings hold data that have formats widely used on the 970 Internet, and sometimes those formats can be validated and presented 971 to the application in appropriate form by the decoder. There are 972 tags for some of these formats. 974 o Tag 32 is for URIs, as defined in [RFC3986]; 975 o Tags 33 and 34 are for base64url- and base64-encoded text strings, 976 as defined in [RFC4648]; 978 o Tag 35 is for regular expressions in Perl Compatible Regular 979 Expressions (PCRE) / JavaScript syntax [ECMA262]. 981 o Tag 36 is for MIME messages (including all headers), as defined in 982 [RFC2045]; 984 Note that tags 33 and 34 differ from 21 and 22 in that the data is 985 transported in base-encoded form for the former and in raw byte 986 string form for the latter. 988 3.4.5. Self-Describe CBOR 990 In many applications, it will be clear from the context that CBOR is 991 being employed for encoding a data item. For instance, a specific 992 protocol might specify the use of CBOR, or a media type is indicated 993 that specifies its use. However, there may be applications where 994 such context information is not available, such as when CBOR data is 995 stored in a file and disambiguating metadata is not in use. Here, it 996 may help to have some distinguishing characteristics for the data 997 itself. 999 Tag 55799 is defined for this purpose. It does not impart any 1000 special semantics on the data item that follows; that is, the 1001 semantics of a data item tagged with tag 55799 is exactly identical 1002 to the semantics of the data item itself. 1004 The serialization of this tag is 0xd9d9f7, which appears not to be in 1005 use as a distinguishing mark for frequently used file types. In 1006 particular, it is not a valid start of a Unicode text in any Unicode 1007 encoding if followed by a valid CBOR data item. 1009 For instance, a decoder might be able to parse both CBOR and JSON. 1010 Such a decoder would need to mechanically distinguish the two 1011 formats. An easy way for an encoder to help the decoder would be to 1012 tag the entire CBOR item with tag 55799, the serialization of which 1013 will never be found at the beginning of a JSON text. 1015 3.5. CBOR Data Models 1017 CBOR is explicit about its generic data model, which defines the set 1018 of all data items that can be represented in CBOR. Its basic generic 1019 data model is extensible by the registration of simple type values 1020 and tags. Applications can then subset the resulting extended 1021 generic data model to build their specific data models. 1023 Within environments that can represent the data items in the generic 1024 data model, generic CBOR encoders and decoders can be implemented 1025 (which usually involves defining additional implementation data types 1026 for those data items that do not already have a natural 1027 representation in the environment). The ability to provide generic 1028 encoders and decoders is an explicit design goal of CBOR; however 1029 many applications will provide their own application-specific 1030 encoders and/or decoders. 1032 In the basic (un-extended) generic data model, a data item is one of: 1034 o an integer in the range -2**64..2**64-1 inclusive 1036 o a simple value, identified by a number between 0 and 255, but 1037 distinct from that number 1039 o a floating point value, distinct from an integer, out of the set 1040 representable by IEEE 754 binary64 (including non-finites) 1042 o a sequence of zero or more bytes ("byte string") 1044 o a sequence of zero or more Unicode code points ("text string") 1046 o a sequence of zero or more data items ("array") 1048 o a mapping (mathematical function) from zero or more data items 1049 ("keys") each to a data item ("values"), ("map") 1051 o a tagged data item, comprising a tag (an integer in the range 1052 0..2**64-1) and a value (a data item) 1054 Note that integer and floating-point values are distinct in this 1055 model, even if they have the same numeric value. 1057 This basic generic data model comes pre-extended by the registration 1058 of a number of simple values and tags right in this document, such 1059 as: 1061 o "false", "true", "null", and "undefined" (simple values identified 1062 by 20..23) 1064 o integer and floating point values with a larger range and 1065 precision than the above (tags 2 to 5) 1067 o application data types such as a point in time or an RFC 3339 1068 date/time string (tags 1, 0) 1070 Further elements of the extended generic data model can be (and have 1071 been) defined via the IANA registries created for CBOR. Even if such 1072 an extension is unknown to a generic encoder or decoder, data items 1073 using that extension can be passed to or from the application by 1074 representing them at the interface to the application within the 1075 basic generic data model, i.e., as generic values of a simple type or 1076 generic tagged items. 1078 In other words, the basic generic data model is stable as defined in 1079 this document, while the extended generic data model expands by the 1080 registration of new simple values or tags, but never shrinks. 1082 While there is a strong expectation that generic encoders and 1083 decoders can represent "false", "true", and "null" ("undefined" is 1084 intentionally omitted) in the form appropriate for their programming 1085 environment, implementation of the data model extensions created by 1086 tags is truly optional and a matter of implementation quality. 1088 A specific data model usually subsets the extended generic data model 1089 and assigns application semantics to the data items within this 1090 subset and its components. When documenting such specific data 1091 models, where it is desired to specify the types of data items, it is 1092 preferred to identify the types by their names in the generic data 1093 model ("negative integer", "array") instead of by referring to 1094 aspects of their CBOR representation ("major type 1", "major type 1095 4"). 1097 4. Creating CBOR-Based Protocols 1099 Data formats such as CBOR are often used in environments where there 1100 is no format negotiation. A specific design goal of CBOR is to not 1101 need any included or assumed schema: a decoder can take a CBOR item 1102 and decode it with no other knowledge. 1104 Of course, in real-world implementations, the encoder and the decoder 1105 will have a shared view of what should be in a CBOR data item. For 1106 example, an agreed-to format might be "the item is an array whose 1107 first value is a UTF-8 string, second value is an integer, and 1108 subsequent values are zero or more floating-point numbers" or "the 1109 item is a map that has byte strings for keys and contains at least 1110 one pair whose key is 0xab01". 1112 This specification puts no restrictions on CBOR-based protocols. An 1113 encoder can be capable of encoding as many or as few types of values 1114 as is required by the protocol in which it is used; a decoder can be 1115 capable of understanding as many or as few types of values as is 1116 required by the protocols in which it is used. This lack of 1117 restrictions allows CBOR to be used in extremely constrained 1118 environments. 1120 This section discusses some considerations in creating CBOR-based 1121 protocols. It is advisory only and explicitly excludes any language 1122 from RFC 2119 other than words that could be interpreted as "MAY" in 1123 the sense of RFC 2119. 1125 4.1. CBOR in Streaming Applications 1127 In a streaming application, a data stream may be composed of a 1128 sequence of CBOR data items concatenated back-to-back. In such an 1129 environment, the decoder immediately begins decoding a new data item 1130 if data is found after the end of a previous data item. 1132 Not all of the bytes making up a data item may be immediately 1133 available to the decoder; some decoders will buffer additional data 1134 until a complete data item can be presented to the application. 1135 Other decoders can present partial information about a top-level data 1136 item to an application, such as the nested data items that could 1137 already be decoded, or even parts of a byte string that hasn't 1138 completely arrived yet. 1140 Note that some applications and protocols will not want to use 1141 indefinite-length encoding. Using indefinite-length encoding allows 1142 an encoder to not need to marshal all the data for counting, but it 1143 requires a decoder to allocate increasing amounts of memory while 1144 waiting for the end of the item. This might be fine for some 1145 applications but not others. 1147 4.2. Generic Encoders and Decoders 1149 A generic CBOR decoder can decode all well-formed CBOR data and 1150 present them to an application. CBOR data is well-formed if it uses 1151 the initial bytes, as well as the byte strings and/or data items that 1152 are implied by their values, in the manner defined by CBOR, and no 1153 extraneous data follows (Appendix C). 1155 Even though CBOR attempts to minimize these cases, not all well- 1156 formed CBOR data is valid: for example, the format excludes simple 1157 values below 32 that are encoded with an extension byte. Also, 1158 specific tags may make semantic constraints that may be violated, 1159 such as by including a tag in a bignum tag or by following a byte 1160 string within a date tag. Finally, the data may be invalid, such as 1161 invalid UTF-8 strings or date strings that do not conform to 1162 [RFC3339]. There is no requirement that generic encoders and 1163 decoders make unnatural choices for their application interface to 1164 enable the processing of invalid data. Generic encoders and decoders 1165 are expected to forward simple values and tags even if their specific 1166 codepoints are not registered at the time the encoder/decoder is 1167 written (Section 4.5). 1169 Generic decoders provide ways to present well-formed CBOR values, 1170 both valid and invalid, to an application. The diagnostic notation 1171 (Section 7) may be used to present well-formed CBOR values to humans. 1173 Generic encoders provide an application interface that allows the 1174 application to specify any well-formed value, including simple values 1175 and tags unknown to the encoder. 1177 4.3. Syntax Errors 1179 A decoder encountering a CBOR data item that is not well-formed 1180 generally can choose to completely fail the decoding (issue an error 1181 and/or stop processing altogether), substitute the problematic data 1182 and data items using a decoder-specific convention that clearly 1183 indicates there has been a problem, or take some other action. 1185 4.3.1. Incomplete CBOR Data Items 1187 The representation of a CBOR data item has a specific length, 1188 determined by its initial bytes and by the structure of any data 1189 items enclosed in the data items. If less data is available, this 1190 can be treated as a syntax error. A decoder may also implement 1191 incremental parsing, that is, decode the data item as far as it is 1192 available and present the data found so far (such as in an event- 1193 based interface), with the option of continuing the decoding once 1194 further data is available. 1196 Examples of incomplete data items include: 1198 o A decoder expects a certain number of array or map entries but 1199 instead encounters the end of the data. 1201 o A decoder processes what it expects to be the last pair in a map 1202 and comes to the end of the data. 1204 o A decoder has just seen a tag and then encounters the end of the 1205 data. 1207 o A decoder has seen the beginning of an indefinite-length item but 1208 encounters the end of the data before it sees the "break" stop 1209 code. 1211 4.3.2. Malformed Indefinite-Length Items 1213 Examples of malformed indefinite-length data items include: 1215 o Within an indefinite-length byte string or text, a decoder finds 1216 an item that is not of the appropriate major type before it finds 1217 the "break" stop code. 1219 o Within an indefinite-length map, a decoder encounters the "break" 1220 stop code immediately after reading a key (the value is missing). 1222 Another error is finding a "break" stop code at a point in the data 1223 where there is no immediately enclosing (unclosed) indefinite-length 1224 item. 1226 4.3.3. Unknown Additional Information Values 1228 At the time of writing, some additional information values are 1229 unassigned and reserved for future versions of this document (see 1230 Section 6.2). Since the overall syntax for these additional 1231 information values is not yet defined, a decoder that sees an 1232 additional information value that it does not understand cannot 1233 continue parsing. 1235 4.4. Other Decoding Errors 1237 A CBOR data item may be syntactically well-formed but present a 1238 problem with interpreting the data encoded in it in the CBOR data 1239 model. Generally speaking, a decoder that finds a data item with 1240 such a problem might issue a warning, might stop processing 1241 altogether, might handle the error and make the problematic value 1242 available to the application as such, or take some other type of 1243 action. 1245 Such problems might include: 1247 Duplicate keys in a map: Generic decoders (Section 4.2) make data 1248 available to applications using the native CBOR data model. That 1249 data model includes maps (key-value mappings with unique keys), 1250 not multimaps (key-value mappings where multiple entries can have 1251 the same key). Thus, a generic decoder that gets a CBOR map item 1252 that has duplicate keys will decode to a map with only one 1253 instance of that key, or it might stop processing altogether. On 1254 the other hand, a "streaming decoder" may not even be able to 1255 notice (Section 4.7). 1257 Inadmissible type on the value following a tag: Tags (Section 3.4) 1258 specify what type of data item is supposed to follow the tag; for 1259 example, the tags for positive or negative bignums are supposed to 1260 be put on byte strings. A decoder that decodes the tagged data 1261 item into a native representation (a native big integer in this 1262 example) is expected to check the type of the data item being 1263 tagged. Even decoders that don't have such native representations 1264 available in their environment may perform the check on those tags 1265 known to them and react appropriately. 1267 Invalid UTF-8 string: A decoder might or might not want to verify 1268 that the sequence of bytes in a UTF-8 string (major type 3) is 1269 actually valid UTF-8 and react appropriately. 1271 4.5. Handling Unknown Simple Values and Tags 1273 A decoder that comes across a simple value (Section 3.3) that it does 1274 not recognize, such as a value that was added to the IANA registry 1275 after the decoder was deployed or a value that the decoder chose not 1276 to implement, might issue a warning, might stop processing 1277 altogether, might handle the error by making the unknown value 1278 available to the application as such (as is expected of generic 1279 decoders), or take some other type of action. 1281 A decoder that comes across a tag (Section 3.4) that it does not 1282 recognize, such as a tag that was added to the IANA registry after 1283 the decoder was deployed or a tag that the decoder chose not to 1284 implement, might issue a warning, might stop processing altogether, 1285 might handle the error and present the unknown tag value together 1286 with the contained data item to the application (as is expected of 1287 generic decoders), might ignore the tag and simply present the 1288 contained data item only to the application, or take some other type 1289 of action. 1291 4.6. Numbers 1293 An application or protocol that uses CBOR might restrict the 1294 representations of numbers. For instance, a protocol that only deals 1295 with integers might say that floating-point numbers may not be used 1296 and that decoders of that protocol do not need to be able to handle 1297 floating-point numbers. Similarly, a protocol or application that 1298 uses CBOR might say that decoders need to be able to handle either 1299 type of number. 1301 CBOR-based protocols should take into account that different language 1302 environments pose different restrictions on the range and precision 1303 of numbers that are representable. For example, the JavaScript 1304 number system treats all numbers as floating point, which may result 1305 in silent loss of precision in decoding integers with more than 53 1306 significant bits. A protocol that uses numbers should define its 1307 expectations on the handling of non-trivial numbers in decoders and 1308 receiving applications. 1310 A CBOR-based protocol that includes floating-point numbers can 1311 restrict which of the three formats (half-precision, single- 1312 precision, and double-precision) are to be supported. For an 1313 integer-only application, a protocol may want to completely exclude 1314 the use of floating-point values. 1316 A CBOR-based protocol designed for compactness may want to exclude 1317 specific integer encodings that are longer than necessary for the 1318 application, such as to save the need to implement 64-bit integers. 1319 There is an expectation that encoders will use the most compact 1320 integer representation that can represent a given value. However, a 1321 compact application should accept values that use a longer-than- 1322 needed encoding (such as encoding "0" as 0b000_11001 followed by two 1323 bytes of 0x00) as long as the application can decode an integer of 1324 the given size. 1326 4.7. Specifying Keys for Maps 1328 The encoding and decoding applications need to agree on what types of 1329 keys are going to be used in maps. In applications that need to 1330 interwork with JSON-based applications, keys probably should be 1331 limited to UTF-8 strings only; otherwise, there has to be a specified 1332 mapping from the other CBOR types to Unicode characters, and this 1333 often leads to implementation errors. In applications where keys are 1334 numeric in nature and numeric ordering of keys is important to the 1335 application, directly using the numbers for the keys is useful. 1337 If multiple types of keys are to be used, consideration should be 1338 given to how these types would be represented in the specific 1339 programming environments that are to be used. For example, in 1340 JavaScript objects, a key of integer 1 cannot be distinguished from a 1341 key of string "1". This means that, if integer keys are used, the 1342 simultaneous use of string keys that look like numbers needs to be 1343 avoided. Again, this leads to the conclusion that keys should be of 1344 a single CBOR type. 1346 Decoders that deliver data items nested within a CBOR data item 1347 immediately on decoding them ("streaming decoders") often do not keep 1348 the state that is necessary to ascertain uniqueness of a key in a 1349 map. Similarly, an encoder that can start encoding data items before 1350 the enclosing data item is completely available ("streaming encoder") 1351 may want to reduce its overhead significantly by relying on its data 1352 source to maintain uniqueness. 1354 A CBOR-based protocol should make an intentional decision about what 1355 to do when a receiving application does see multiple identical keys 1356 in a map. The resulting rule in the protocol should respect the CBOR 1357 data model: it cannot prescribe a specific handling of the entries 1358 with the identical keys, except that it might have a rule that having 1359 identical keys in a map indicates a malformed map and that the 1360 decoder has to stop with an error. Duplicate keys are also 1361 prohibited by CBOR decoders that are using strict mode 1362 (Section 4.10). 1364 The CBOR data model for maps does not allow ascribing semantics to 1365 the order of the key/value pairs in the map representation. Thus, it 1366 would be a very bad practice to define a CBOR-based protocol in such 1367 a way that changing the key/value pair order in a map would change 1368 the semantics, apart from trivial aspects (cache usage, etc.). (A 1369 CBOR-based protocol can prescribe a specific order of serialization, 1370 such as for canonicalization.) 1372 Applications for constrained devices that have maps with 24 or fewer 1373 frequently used keys should consider using small integers (and those 1374 with up to 48 frequently used keys should consider also using small 1375 negative integers) because the keys can then be encoded in a single 1376 byte. 1378 4.7.1. Equivalence of Keys 1380 This notion of equivalence must be used to determine whether keys in 1381 maps are duplicates or distinct. 1383 o All numbers are compared by their numeric value. 1385 * Integer data items with the same value are equal regardless of 1386 how many bytes are used to encode them. 1388 * Floating point data items with the same value are equal 1389 regardless of how many bytes are used to encode them. 1391 * An integer value encoded as a floating point data item is 1392 equivalent to the same value encoded as an integer 1394 o Byte strings and text strings are compared by their binary 1395 content. 1397 * A different length encoding has no effect on equivalence. 1399 * A byte string is equal to a text string if they have the same 1400 binary content. 1402 o Two arrays are equal if all their items are in the same order and 1403 equal. 1405 o Two maps are equal if they have the same set of pairs regardless 1406 of their order; pairs are equal if both the key and value are 1407 equal. 1409 o Tags have no effect in determining equality of a data item, if two 1410 items are equal then they are equal irrespective of any tags that 1411 either or both may have. 1413 o Simple values are equal if they simply have the same value. 1415 Nothing else is equal, a simple value 2 is not equivalent to an 1416 integer 2 and an array cannot be equivalent to a map with the same 1417 values and sequential integer keys. 1419 4.8. Undefined Values 1421 In some CBOR-based protocols, the simple value (Section 3.3) of 1422 Undefined might be used by an encoder as a substitute for a data item 1423 with an encoding problem, in order to allow the rest of the enclosing 1424 data items to be encoded without harm. 1426 4.9. Canonical CBOR 1428 Some protocols may want encoders to only emit CBOR in a particular 1429 canonical format; those protocols might also have the decoders check 1430 that their input is canonical. Those protocols are free to define 1431 what they mean by a canonical format and what encoders and decoders 1432 are expected to do. This section defines a set of restrictions that 1433 can serve as the base of such a canonical format. 1435 A CBOR encoding satisfies the "core canonicalization requirements" if 1436 it satisfies the following restrictions: 1438 o Integers MUST be as short as possible. In particular: 1440 * 0 to 23 and -1 to -24 MUST be expressed in the same byte as the 1441 major type; 1443 * 24 to 255 and -25 to -256 MUST be expressed only with an 1444 additional uint8_t; 1446 * 256 to 65535 and -257 to -65536 MUST be expressed only with an 1447 additional uint16_t; 1449 * 65536 to 4294967295 and -65537 to -4294967296 MUST be expressed 1450 only with an additional uint32_t. 1452 o The expression of lengths in major types 2 through 5 MUST be as 1453 short as possible. The rules for these lengths follow the above 1454 rule for integers. 1456 o The keys in every map MUST be sorted in the bytewise lexicographic 1457 order of their canonical encodings. For example, the following 1458 keys are sorted correctly: 1460 1. 10, encoded as 0x0a. 1462 2. 100, encoded as 0x1864. 1464 3. -1, encoded as 0x20. 1466 4. "z", encoded as 0x617a. 1468 5. "aa", encoded as 0x626161. 1470 6. [100], encoded as 0x811864. 1472 7. [-1], encoded as 0x8120. 1474 8. false, encoded as 0xf4. 1476 o Indefinite-length items MUST not appear. They can be encoded as 1477 definite-length items instead. 1479 If a protocol allows for IEEE floats, then additional 1480 canonicalization rules might need to be added. One example rule 1481 might be to have all floats start as a 64-bit float, then do a test 1482 conversion to a 32-bit float; if the result is the same numeric 1483 value, use the shorter value and repeat the process with a test 1484 conversion to a 16-bit float. (This rule selects 16-bit float for 1485 positive and negative Infinity as well.) Also, there are many 1486 representations for NaN. If NaN is an allowed value, it must always 1487 be represented as 0xf97e00. 1489 CBOR tags present additional considerations for canonicalization. 1490 The absence or presence of tags in a canonical format is determined 1491 by the optionality of the tags in the protocol. In a CBOR-based 1492 protocol that allows optional tagging anywhere, the canonical format 1493 must not allow them. In a protocol that requires tags in certain 1494 places, the tag needs to appear in the canonical format. A CBOR- 1495 based protocol that uses canonicalization might instead say that all 1496 tags that appear in a message must be retained regardless of whether 1497 they are optional. 1499 Protocols that include floating, big integer, or other complex values 1500 need to define extra requirements on their canonical encodings. For 1501 example: 1503 o If a protocol includes a field that can express floating values 1504 (Section 3.3), the protocol's canonicalization needs to specify 1505 whether the integer 1.0 is encoded as 0x01, 0xf93c00, 1506 0xfa3f800000, or 0xfb3ff0000000000000. Three sensible rules for 1507 this are: 1509 1. Encode integral values that fit in 64 bits as values from 1510 major types 0 and 1, and other values as the smallest of 16-, 1511 32-, or 64-bit floating point that accurately represents the 1512 value, 1514 2. Encode all values as the smallest of 16-, 32-, or 64-bit 1515 floating point that accurately represents the value, even for 1516 integral values, or 1518 3. Encode all values as 64-bit floating point. 1520 If NaN is an allowed value, the protocol needs to pick a single 1521 representation, for example 0xf97e00. 1523 o If a protocol includes a field that can express integers larger 1524 than 2^64 using tag 2 (Section 3.4.2), the protocol's 1525 canonicalization needs to specify whether small integers are 1526 expressed using the tag or major types 0 and 1. 1528 o A protocol might give encoders the choice of representing a URL as 1529 either a text string or, using Section 3.4.4.3, tag 32 containing 1530 a text string. This protocol's canonicalization needs to either 1531 require that the tag is present or require that it's absent, not 1532 allow either one. 1534 4.9.1. Length-first map key ordering 1536 The core canonicalization requirements sort map keys in a different 1537 order from the one suggested by [RFC7049]. Protocols that need to be 1538 compatible with [RFC7049]'s order can instead be specified in terms 1539 of this specification's "length-first core canonicalization 1540 requirements": 1542 A CBOR encoding satisfies the "length-first core canonicalization 1543 requirements" if it satisfies the core canonicalization requirements 1544 except that the keys in every map MUST be sorted such that: 1546 1. If two keys have different lengths, the shorter one sorts 1547 earlier; 1549 2. If two keys have the same length, the one with the lower value in 1550 (byte-wise) lexical order sorts earlier. 1552 For example, under the length-first core canonicalization 1553 requirements, the following keys are sorted correctly: 1555 1. 10, encoded as 0x0a. 1557 2. -1, encoded as 0x20. 1559 3. false, encoded as 0xf4. 1561 4. 100, encoded as 0x1864. 1563 5. "z", encoded as 0x617a. 1565 6. [-1], encoded as 0x8120. 1567 7. "aa", encoded as 0x626161. 1569 8. [100], encoded as 0x811864. 1571 4.10. Strict Mode 1573 Some areas of application of CBOR do not require canonicalization 1574 (Section 4.9) but may require that different decoders reach the same 1575 (semantically equivalent) results, even in the presence of 1576 potentially malicious data. This can be required if one application 1577 (such as a firewall or other protecting entity) makes a decision 1578 based on the data that another application, which independently 1579 decodes the data, relies on. 1581 Normally, it is the responsibility of the sender to avoid ambiguously 1582 decodable data. However, the sender might be an attacker specially 1583 making up CBOR data such that it will be interpreted differently by 1584 different decoders in an attempt to exploit that as a vulnerability. 1585 Generic decoders used in applications where this might be a problem 1586 need to support a strict mode in which it is also the responsibility 1587 of the receiver to reject ambiguously decodable data. It is expected 1588 that firewalls and other security systems that decode CBOR will only 1589 decode in strict mode. 1591 A decoder in strict mode will reliably reject any data that could be 1592 interpreted by other decoders in different ways. It will reliably 1593 reject data items with syntax errors (Section 4.3). It will also 1594 expend the effort to reliably detect other decoding errors 1595 (Section 4.4). In particular, a strict decoder needs to have an API 1596 that reports an error (and does not return data) for a CBOR data item 1597 that contains any of the following: 1599 o a map (major type 5) that has more than one entry with the same 1600 key 1602 o a tag that is used on a data item of the incorrect type 1604 o a data item that is incorrectly formatted for the type given to 1605 it, such as invalid UTF-8 or data that cannot be interpreted with 1606 the specific tag that it has been tagged with 1608 A decoder in strict mode can do one of two things when it encounters 1609 a tag or simple value that it does not recognize: 1611 o It can report an error (and not return data). 1613 o It can emit the unknown item (type, value, and, for tags, the 1614 decoded tagged data item) to the application calling the decoder 1615 with an indication that the decoder did not recognize that tag or 1616 simple value. 1618 The latter approach, which is also appropriate for non-strict 1619 decoders, supports forward compatibility with newly registered tags 1620 and simple values without the requirement to update the encoder at 1621 the same time as the calling application. (For this, the API for the 1622 decoder needs to have a way to mark unknown items so that the calling 1623 application can handle them in a manner appropriate for the program.) 1625 Since some of this processing may have an appreciable cost (in 1626 particular with duplicate detection for maps), support of strict mode 1627 is not a requirement placed on all CBOR decoders. 1629 Some encoders will rely on their applications to provide input data 1630 in such a way that unambiguously decodable CBOR results. A generic 1631 encoder also may want to provide a strict mode where it reliably 1632 limits its output to unambiguously decodable CBOR, independent of 1633 whether or not its application is providing API-conformant data. 1635 5. Converting Data between CBOR and JSON 1637 This section gives non-normative advice about converting between CBOR 1638 and JSON. Implementations of converters are free to use whichever 1639 advice here they want. 1641 It is worth noting that a JSON text is a sequence of characters, not 1642 an encoded sequence of bytes, while a CBOR data item consists of 1643 bytes, not characters. 1645 5.1. Converting from CBOR to JSON 1647 Most of the types in CBOR have direct analogs in JSON. However, some 1648 do not, and someone implementing a CBOR-to-JSON converter has to 1649 consider what to do in those cases. The following non-normative 1650 advice deals with these by converting them to a single substitute 1651 value, such as a JSON null. 1653 o An integer (major type 0 or 1) becomes a JSON number. 1655 o A byte string (major type 2) that is not embedded in a tag that 1656 specifies a proposed encoding is encoded in base64url without 1657 padding and becomes a JSON string. 1659 o A UTF-8 string (major type 3) becomes a JSON string. Note that 1660 JSON requires escaping certain characters (RFC 7159, Section 7): 1661 quotation mark (U+0022), reverse solidus (U+005C), and the "C0 1662 control characters" (U+0000 through U+001F). All other characters 1663 are copied unchanged into the JSON UTF-8 string. 1665 o An array (major type 4) becomes a JSON array. 1667 o A map (major type 5) becomes a JSON object. This is possible 1668 directly only if all keys are UTF-8 strings. A converter might 1669 also convert other keys into UTF-8 strings (such as by converting 1670 integers into strings containing their decimal representation); 1671 however, doing so introduces a danger of key collision. 1673 o False (major type 7, additional information 20) becomes a JSON 1674 false. 1676 o True (major type 7, additional information 21) becomes a JSON 1677 true. 1679 o Null (major type 7, additional information 22) becomes a JSON 1680 null. 1682 o A floating-point value (major type 7, additional information 25 1683 through 27) becomes a JSON number if it is finite (that is, it can 1684 be represented in a JSON number); if the value is non-finite (NaN, 1685 or positive or negative Infinity), it is represented by the 1686 substitute value. 1688 o Any other simple value (major type 7, any additional information 1689 value not yet discussed) is represented by the substitute value. 1691 o A bignum (major type 6, tag value 2 or 3) is represented by 1692 encoding its byte string in base64url without padding and becomes 1693 a JSON string. For tag value 3 (negative bignum), a "~" (ASCII 1694 tilde) is inserted before the base-encoded value. (The conversion 1695 to a binary blob instead of a number is to prevent a likely 1696 numeric overflow for the JSON decoder.) 1698 o A byte string with an encoding hint (major type 6, tag value 21 1699 through 23) is encoded as described and becomes a JSON string. 1701 o For all other tags (major type 6, any other tag value), the 1702 embedded CBOR item is represented as a JSON value; the tag value 1703 is ignored. 1705 o Indefinite-length items are made definite before conversion. 1707 5.2. Converting from JSON to CBOR 1709 All JSON values, once decoded, directly map into one or more CBOR 1710 values. As with any kind of CBOR generation, decisions have to be 1711 made with respect to number representation. In a suggested 1712 conversion: 1714 o JSON numbers without fractional parts (integer numbers) are 1715 represented as integers (major types 0 and 1, possibly major type 1716 6 tag value 2 and 3), choosing the shortest form; integers longer 1717 than an implementation-defined threshold (which is usually either 1718 32 or 64 bits) may instead be represented as floating-point 1719 values. (If the JSON was generated from a JavaScript 1720 implementation, its precision is already limited to 53 bits 1721 maximum.) 1723 o Numbers with fractional parts are represented as floating-point 1724 values. Preferably, the shortest exact floating-point 1725 representation is used; for instance, 1.5 is represented in a 1726 16-bit floating-point value (not all implementations will be 1727 capable of efficiently finding the minimum form, though). There 1728 may be an implementation-defined limit to the precision that will 1729 affect the precision of the represented values. Decimal 1730 representation should only be used if that is specified in a 1731 protocol. 1733 CBOR has been designed to generally provide a more compact encoding 1734 than JSON. One implementation strategy that might come to mind is to 1735 perform a JSON-to-CBOR encoding in place in a single buffer. This 1736 strategy would need to carefully consider a number of pathological 1737 cases, such as that some strings represented with no or very few 1738 escapes and longer (or much longer) than 255 bytes may expand when 1739 encoded as UTF-8 strings in CBOR. Similarly, a few of the binary 1740 floating-point representations might cause expansion from some short 1741 decimal representations (1.1, 1e9) in JSON. This may be hard to get 1742 right, and any ensuing vulnerabilities may be exploited by an 1743 attacker. 1745 6. Future Evolution of CBOR 1747 Successful protocols evolve over time. New ideas appear, 1748 implementation platforms improve, related protocols are developed and 1749 evolve, and new requirements from applications and protocols are 1750 added. Facilitating protocol evolution is therefore an important 1751 design consideration for any protocol development. 1753 For protocols that will use CBOR, CBOR provides some useful 1754 mechanisms to facilitate their evolution. Best practices for this 1755 are well known, particularly from JSON format development of JSON- 1756 based protocols. Therefore, such best practices are outside the 1757 scope of this specification. 1759 However, facilitating the evolution of CBOR itself is very well 1760 within its scope. CBOR is designed to both provide a stable basis 1761 for development of CBOR-based protocols and to be able to evolve. 1762 Since a successful protocol may live for decades, CBOR needs to be 1763 designed for decades of use and evolution. This section provides 1764 some guidance for the evolution of CBOR. It is necessarily more 1765 subjective than other parts of this document. It is also necessarily 1766 incomplete, lest it turn into a textbook on protocol development. 1768 6.1. Extension Points 1770 In a protocol design, opportunities for evolution are often included 1771 in the form of extension points. For example, there may be a 1772 codepoint space that is not fully allocated from the outset, and the 1773 protocol is designed to tolerate and embrace implementations that 1774 start using more codepoints than initially allocated. 1776 Sizing the codepoint space may be difficult because the range 1777 required may be hard to predict. An attempt should be made to make 1778 the codepoint space large enough so that it can slowly be filled over 1779 the intended lifetime of the protocol. 1781 CBOR has three major extension points: 1783 o the "simple" space (values in major type 7). Of the 24 efficient 1784 (and 224 slightly less efficient) values, only a small number have 1785 been allocated. Implementations receiving an unknown simple data 1786 item may be able to process it as such, given that the structure 1787 of the value is indeed simple. The IANA registry in Section 8.1 1788 is the appropriate way to address the extensibility of this 1789 codepoint space. 1791 o the "tag" space (values in major type 6). Again, only a small 1792 part of the codepoint space has been allocated, and the space is 1793 abundant (although the early numbers are more efficient than the 1794 later ones). Implementations receiving an unknown tag can choose 1795 to simply ignore it or to process it as an unknown tag wrapping 1796 the following data item. The IANA registry in Section 8.2 is the 1797 appropriate way to address the extensibility of this codepoint 1798 space. 1800 o the "additional information" space. An implementation receiving 1801 an unknown additional information value has no way to continue 1802 parsing, so allocating codepoints to this space is a major step. 1803 There are also very few codepoints left. 1805 6.2. Curating the Additional Information Space 1807 The human mind is sometimes drawn to filling in little perceived gaps 1808 to make something neat. We expect the remaining gaps in the 1809 codepoint space for the additional information values to be an 1810 attractor for new ideas, just because they are there. 1812 The present specification does not manage the additional information 1813 codepoint space by an IANA registry. Instead, allocations out of 1814 this space can only be done by updating this specification. 1816 For an additional information value of n >= 24, the size of the 1817 additional data typically is 2**(n-24) bytes. Therefore, additional 1818 information values 28 and 29 should be viewed as candidates for 1819 128-bit and 256-bit quantities, in case a need arises to add them to 1820 the protocol. Additional information value 30 is then the only 1821 additional information value available for general allocation, and 1822 there should be a very good reason for allocating it before assigning 1823 it through an update of this protocol. 1825 7. Diagnostic Notation 1827 CBOR is a binary interchange format. To facilitate documentation and 1828 debugging, and in particular to facilitate communication between 1829 entities cooperating in debugging, this section defines a simple 1830 human-readable diagnostic notation. All actual interchange always 1831 happens in the binary format. 1833 Note that this truly is a diagnostic format; it is not meant to be 1834 parsed. Therefore, no formal definition (as in ABNF) is given in 1835 this document. (Implementers looking for a text-based format for 1836 representing CBOR data items in configuration files may also want to 1837 consider YAML [YAML].) 1839 The diagnostic notation is loosely based on JSON as it is defined in 1840 RFC 7159, extending it where needed. 1842 The notation borrows the JSON syntax for numbers (integer and 1843 floating point), True (>true<), False (>false<), Null (>null<), UTF-8 1844 strings, arrays, and maps (maps are called objects in JSON; the 1845 diagnostic notation extends JSON here by allowing any data item in 1846 the key position). Undefined is written >undefined< as in 1847 JavaScript. The non-finite floating-point numbers Infinity, 1848 -Infinity, and NaN are written exactly as in this sentence (this is 1849 also a way they can be written in JavaScript, although JSON does not 1850 allow them). A tagged item is written as an integer number for the 1851 tag followed by the item in parentheses; for instance, an RFC 3339 1852 (ISO 8601) date could be notated as: 1854 0("2013-03-21T20:04:00Z") 1856 or the equivalent relative time as 1858 1(1363896240) 1860 Byte strings are notated in one of the base encodings, without 1861 padding, enclosed in single quotes, prefixed by >h< for base16, >b32< 1862 for base32, >h32< for base32hex, >b64< for base64 or base64url (the 1863 actual encodings do not overlap, so the string remains unambiguous). 1864 For example, the byte string 0x12345678 could be written h'12345678', 1865 b32'CI2FM6A', or b64'EjRWeA'. 1867 Unassigned simple values are given as "simple()" with the appropriate 1868 integer in the parentheses. For example, "simple(42)" indicates 1869 major type 7, value 42. 1871 7.1. Encoding Indicators 1873 Sometimes it is useful to indicate in the diagnostic notation which 1874 of several alternative representations were actually used; for 1875 example, a data item written >1.5< by a diagnostic decoder might have 1876 been encoded as a half-, single-, or double-precision float. 1878 The convention for encoding indicators is that anything starting with 1879 an underscore and all following characters that are alphanumeric or 1880 underscore, is an encoding indicator, and can be ignored by anyone 1881 not interested in this information. Encoding indicators are always 1882 optional. 1884 A single underscore can be written after the opening brace of a map 1885 or the opening bracket of an array to indicate that the data item was 1886 represented in indefinite-length format. For example, [_ 1, 2] 1887 contains an indicator that an indefinite-length representation was 1888 used to represent the data item [1, 2]. 1890 An underscore followed by a decimal digit n indicates that the 1891 preceding item (or, for arrays and maps, the item starting with the 1892 preceding bracket or brace) was encoded with an additional 1893 information value of 24+n. For example, 1.5_1 is a half-precision 1894 floating-point number, while 1.5_3 is encoded as double precision. 1895 This encoding indicator is not shown in Appendix A. (Note that the 1896 encoding indicator "_" is thus an abbreviation of the full form "_7", 1897 which is not used.) 1899 As a special case, byte and text strings of indefinite length can be 1900 notated in the form (_ h'0123', h'4567') and (_ "foo", "bar"). 1902 8. IANA Considerations 1904 IANA has created two registries for new CBOR values. The registries 1905 are separate, that is, not under an umbrella registry, and follow the 1906 rules in [RFC5226]. IANA has also assigned a new MIME media type and 1907 an associated Constrained Application Protocol (CoAP) Content-Format 1908 entry. 1910 8.1. Simple Values Registry 1912 IANA has created the "Concise Binary Object Representation (CBOR) 1913 Simple Values" registry. The initial values are shown in Table 2. 1915 New entries in the range 0 to 19 are assigned by Standards Action. 1916 It is suggested that these Standards Actions allocate values starting 1917 with the number 16 in order to reserve the lower numbers for 1918 contiguous blocks (if any). 1920 New entries in the range 32 to 255 are assigned by Specification 1921 Required. 1923 8.2. Tags Registry 1925 IANA has created the "Concise Binary Object Representation (CBOR) 1926 Tags" registry. The initial values are shown in Table 3. 1928 New entries in the range 0 to 23 are assigned by Standards Action. 1929 New entries in the range 24 to 255 are assigned by Specification 1930 Required. New entries in the range 256 to 18446744073709551615 are 1931 assigned by First Come First Served. The template for registration 1932 requests is: 1934 o Data item 1936 o Semantics (short form) 1938 In addition, First Come First Served requests should include: 1940 o Point of contact 1942 o Description of semantics (URL) - This description is optional; the 1943 URL can point to something like an Internet-Draft or a web page. 1945 8.3. Media Type ("MIME Type") 1947 The Internet media type [RFC6838] for CBOR data is application/cbor. 1949 Type name: application 1951 Subtype name: cbor 1953 Required parameters: n/a 1955 Optional parameters: n/a 1957 Encoding considerations: binary 1959 Security considerations: See Section 9 of this document 1961 Interoperability considerations: n/a 1963 Published specification: This document 1965 Applications that use this media type: None yet, but it is expected 1966 that this format will be deployed in protocols and applications. 1968 Additional information: 1969 Magic number(s): n/a 1970 File extension(s): .cbor 1971 Macintosh file type code(s): n/a 1973 Person & email address to contact for further information: 1974 Carsten Bormann 1975 cabo@tzi.org 1977 Intended usage: COMMON 1979 Restrictions on usage: none 1981 Author: 1982 Carsten Bormann 1984 Change controller: 1985 The IESG 1987 8.4. CoAP Content-Format 1989 Media Type: application/cbor 1991 Encoding: - 1993 Id: 60 1995 Reference: [RFCthis] 1997 8.5. The +cbor Structured Syntax Suffix Registration 1999 Name: Concise Binary Object Representation (CBOR) 2001 +suffix: +cbor 2003 References: [RFCthis] 2005 Encoding Considerations: CBOR is a binary format. 2007 Interoperability Considerations: n/a 2008 Fragment Identifier Considerations: 2009 The syntax and semantics of fragment identifiers specified for 2010 +cbor SHOULD be as specified for "application/cbor". (At 2011 publication of this document, there is no fragment identification 2012 syntax defined for "application/cbor".) 2014 The syntax and semantics for fragment identifiers for a specific 2015 "xxx/yyy+cbor" SHOULD be processed as follows: 2017 For cases defined in +cbor, where the fragment identifier resolves 2018 per the +cbor rules, then process as specified in +cbor. 2020 For cases defined in +cbor, where the fragment identifier does 2021 not resolve per the +cbor rules, then process as specified in 2022 "xxx/yyy+cbor". 2024 For cases not defined in +cbor, then process as specified in 2025 "xxx/yyy+cbor". 2027 Security Considerations: See Section 9 of this document 2029 Contact: 2030 Apps Area Working Group (apps-discuss@ietf.org) 2032 Author/Change Controller: 2033 The Apps Area Working Group. 2034 The IESG has change control over this registration. 2036 9. Security Considerations 2038 A network-facing application can exhibit vulnerabilities in its 2039 processing logic for incoming data. Complex parsers are well known 2040 as a likely source of such vulnerabilities, such as the ability to 2041 remotely crash a node, or even remotely execute arbitrary code on it. 2042 CBOR attempts to narrow the opportunities for introducing such 2043 vulnerabilities by reducing parser complexity, by giving the entire 2044 range of encodable values a meaning where possible. 2046 Resource exhaustion attacks might attempt to lure a decoder into 2047 allocating very big data items (strings, arrays, maps) or exhaust the 2048 stack depth by setting up deeply nested items. Decoders need to have 2049 appropriate resource management to mitigate these attacks. (Items 2050 for which very large sizes are given can also attempt to exploit 2051 integer overflow vulnerabilities.) 2053 Applications where a CBOR data item is examined by a gatekeeper 2054 function and later used by a different application may exhibit 2055 vulnerabilities when multiple interpretations of the data item are 2056 possible. For example, an attacker could make use of duplicate keys 2057 in maps and precision issues in numbers to make the gatekeeper base 2058 its decisions on a different interpretation than the one that will be 2059 used by the second application. Protocols that are used in a 2060 security context should be defined in such a way that these multiple 2061 interpretations are reliably reduced to a single one. To facilitate 2062 this, encoder and decoder implementations used in such contexts 2063 should provide at least one strict mode of operation (Section 4.10). 2065 10. Acknowledgements 2067 CBOR was inspired by MessagePack. MessagePack was developed and 2068 promoted by Sadayuki Furuhashi ("frsyuki"). This reference to 2069 MessagePack is solely for attribution; CBOR is not intended as a 2070 version of or replacement for MessagePack, as it has different design 2071 goals and requirements. 2073 The need for functionality beyond the original MessagePack 2074 Specification became obvious to many people at about the same time 2075 around the year 2012. BinaryPack is a minor derivation of 2076 MessagePack that was developed by Eric Zhang for the binaryjs 2077 project. A similar, but different, extension was made by Tim Caswell 2078 for his msgpack-js and msgpack-js-browser projects. Many people have 2079 contributed to the recent discussion about extending MessagePack to 2080 separate text string representation from byte string representation. 2082 The encoding of the additional information in CBOR was inspired by 2083 the encoding of length information designed by Klaus Hartke for CoAP. 2085 This document also incorporates suggestions made by many people, 2086 notably Dan Frost, James Manger, Joe Hildebrand, Keith Moore, Matthew 2087 Lepinski, Nico Williams, Phillip Hallam-Baker, Ray Polk, Tim Bray, 2088 Tony Finch, Tony Hansen, and Yaron Sheffer. 2090 11. References 2092 11.1. Normative References 2094 [ECMA262] European Computer Manufacturers Association, "ECMAScript 2095 Language Specification 5.1 Edition", ECMA Standard ECMA- 2096 262, June 2011, . 2100 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 2101 Extensions (MIME) Part One: Format of Internet Message 2102 Bodies", RFC 2045, DOI 10.17487/RFC2045, November 1996, 2103 . 2105 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2106 Requirement Levels", BCP 14, RFC 2119, 2107 DOI 10.17487/RFC2119, March 1997, 2108 . 2110 [RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet: 2111 Timestamps", RFC 3339, DOI 10.17487/RFC3339, July 2002, 2112 . 2114 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 2115 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 2116 2003, . 2118 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 2119 Resource Identifier (URI): Generic Syntax", STD 66, 2120 RFC 3986, DOI 10.17487/RFC3986, January 2005, 2121 . 2123 [RFC4287] Nottingham, M., Ed. and R. Sayre, Ed., "The Atom 2124 Syndication Format", RFC 4287, DOI 10.17487/RFC4287, 2125 December 2005, . 2127 [RFC4648] Josefsson, S., "The Base16, Base32, and Base64 Data 2128 Encodings", RFC 4648, DOI 10.17487/RFC4648, October 2006, 2129 . 2131 [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an 2132 IANA Considerations Section in RFCs", RFC 5226, 2133 DOI 10.17487/RFC5226, May 2008, 2134 . 2136 [TIME_T] The Open Group Base Specifications, "Vol. 1: Base 2137 Definitions, Issue 7", Section 4.15 'Seconds Since the 2138 Epoch', IEEE Std 1003.1, 2013 Edition, 2013, 2139 . 2142 11.2. Informative References 2144 [ASN.1] International Telecommunication Union, "Information 2145 Technology -- ASN.1 encoding rules: Specification of Basic 2146 Encoding Rules (BER), Canonical Encoding Rules (CER) and 2147 Distinguished Encoding Rules (DER)", ITU-T Recommendation 2148 X.690, 1994. 2150 [BSON] Various, "BSON - Binary JSON", 2013, 2151 . 2153 [MessagePack] 2154 Furuhashi, S., "MessagePack", 2013, . 2156 [RFC0713] Haverty, J., "MSDTP-Message Services Data Transmission 2157 Protocol", RFC 713, DOI 10.17487/RFC0713, April 1976, 2158 . 2160 [RFC6838] Freed, N., Klensin, J., and T. Hansen, "Media Type 2161 Specifications and Registration Procedures", BCP 13, 2162 RFC 6838, DOI 10.17487/RFC6838, January 2013, 2163 . 2165 [RFC7049] Bormann, C. and P. Hoffman, "Concise Binary Object 2166 Representation (CBOR)", RFC 7049, DOI 10.17487/RFC7049, 2167 October 2013, . 2169 [RFC7159] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data 2170 Interchange Format", RFC 7159, DOI 10.17487/RFC7159, March 2171 2014, . 2173 [RFC7228] Bormann, C., Ersue, M., and A. Keranen, "Terminology for 2174 Constrained-Node Networks", RFC 7228, 2175 DOI 10.17487/RFC7228, May 2014, 2176 . 2178 [UBJSON] The Buzz Media, "Universal Binary JSON Specification", 2179 2013, . 2181 [YAML] Ben-Kiki, O., Evans, C., and I. Net, "YAML Ain't Markup 2182 Language (YAML[TM]) Version 1.2", 3rd Edition, October 2183 2009, . 2185 Appendix A. Examples 2187 The following table provides some CBOR-encoded values in hexadecimal 2188 (right column), together with diagnostic notation for these values 2189 (left column). Note that the string "\u00fc" is one form of 2190 diagnostic notation for a UTF-8 string containing the single Unicode 2191 character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (u umlaut). 2192 Similarly, "\u6c34" is a UTF-8 string in diagnostic notation with a 2193 single character U+6C34 (CJK UNIFIED IDEOGRAPH-6C34, often 2194 representing "water"), and "\ud800\udd51" is a UTF-8 string in 2195 diagnostic notation with a single character U+10151 (GREEK ACROPHONIC 2196 ATTIC FIFTY STATERS). (Note that all these single-character strings 2197 could also be represented in native UTF-8 in diagnostic notation, 2198 just not in an ASCII-only specification like the present one.) In 2199 the diagnostic notation provided for bignums, their intended numeric 2200 value is shown as a decimal number (such as 18446744073709551616) 2201 instead of showing a tagged byte string (such as 2202 2(h'010000000000000000')). 2204 +------------------------------+------------------------------------+ 2205 | Diagnostic | Encoded | 2206 +------------------------------+------------------------------------+ 2207 | 0 | 0x00 | 2208 | | | 2209 | 1 | 0x01 | 2210 | | | 2211 | 10 | 0x0a | 2212 | | | 2213 | 23 | 0x17 | 2214 | | | 2215 | 24 | 0x1818 | 2216 | | | 2217 | 25 | 0x1819 | 2218 | | | 2219 | 100 | 0x1864 | 2220 | | | 2221 | 1000 | 0x1903e8 | 2222 | | | 2223 | 1000000 | 0x1a000f4240 | 2224 | | | 2225 | 1000000000000 | 0x1b000000e8d4a51000 | 2226 | | | 2227 | 18446744073709551615 | 0x1bffffffffffffffff | 2228 | | | 2229 | 18446744073709551616 | 0xc249010000000000000000 | 2230 | | | 2231 | -18446744073709551616 | 0x3bffffffffffffffff | 2232 | | | 2233 | -18446744073709551617 | 0xc349010000000000000000 | 2234 | | | 2235 | -1 | 0x20 | 2236 | | | 2237 | -10 | 0x29 | 2238 | | | 2239 | -100 | 0x3863 | 2240 | | | 2241 | -1000 | 0x3903e7 | 2242 | | | 2243 | 0.0 | 0xf90000 | 2244 | | | 2245 | -0.0 | 0xf98000 | 2246 | | | 2247 | 1.0 | 0xf93c00 | 2248 | | | 2249 | 1.1 | 0xfb3ff199999999999a | 2250 | | | 2251 | 1.5 | 0xf93e00 | 2252 | | | 2253 | 65504.0 | 0xf97bff | 2254 | | | 2255 | 100000.0 | 0xfa47c35000 | 2256 | | | 2257 | 3.4028234663852886e+38 | 0xfa7f7fffff | 2258 | | | 2259 | 1.0e+300 | 0xfb7e37e43c8800759c | 2260 | | | 2261 | 5.960464477539063e-8 | 0xf90001 | 2262 | | | 2263 | 0.00006103515625 | 0xf90400 | 2264 | | | 2265 | -4.0 | 0xf9c400 | 2266 | | | 2267 | -4.1 | 0xfbc010666666666666 | 2268 | | | 2269 | Infinity | 0xf97c00 | 2270 | | | 2271 | NaN | 0xf97e00 | 2272 | | | 2273 | -Infinity | 0xf9fc00 | 2274 | | | 2275 | Infinity | 0xfa7f800000 | 2276 | | | 2277 | NaN | 0xfa7fc00000 | 2278 | | | 2279 | -Infinity | 0xfaff800000 | 2280 | | | 2281 | Infinity | 0xfb7ff0000000000000 | 2282 | | | 2283 | NaN | 0xfb7ff8000000000000 | 2284 | | | 2285 | -Infinity | 0xfbfff0000000000000 | 2286 | | | 2287 | false | 0xf4 | 2288 | | | 2289 | true | 0xf5 | 2290 | | | 2291 | null | 0xf6 | 2292 | | | 2293 | undefined | 0xf7 | 2294 | | | 2295 | simple(16) | 0xf0 | 2296 | | | 2297 | simple(24) | 0xf818 | 2298 | | | 2299 | simple(255) | 0xf8ff | 2300 | | | 2301 | 0("2013-03-21T20:04:00Z") | 0xc074323031332d30332d32315432303a | 2302 | | 30343a30305a | 2303 | | | 2304 | 1(1363896240) | 0xc11a514b67b0 | 2305 | | | 2306 | 1(1363896240.5) | 0xc1fb41d452d9ec200000 | 2307 | | | 2308 | 23(h'01020304') | 0xd74401020304 | 2309 | | | 2310 | 24(h'6449455446') | 0xd818456449455446 | 2311 | | | 2312 | 32("http://www.example.com") | 0xd82076687474703a2f2f7777772e6578 | 2313 | | 616d706c652e636f6d | 2314 | | | 2315 | h'' | 0x40 | 2316 | | | 2317 | h'01020304' | 0x4401020304 | 2318 | | | 2319 | "" | 0x60 | 2320 | | | 2321 | "a" | 0x6161 | 2322 | | | 2323 | "IETF" | 0x6449455446 | 2324 | | | 2325 | "\"\\" | 0x62225c | 2326 | | | 2327 | "\u00fc" | 0x62c3bc | 2328 | | | 2329 | "\u6c34" | 0x63e6b0b4 | 2330 | | | 2331 | "\ud800\udd51" | 0x64f0908591 | 2332 | | | 2333 | [] | 0x80 | 2334 | | | 2335 | [1, 2, 3] | 0x83010203 | 2336 | | | 2337 | [1, [2, 3], [4, 5]] | 0x8301820203820405 | 2338 | | | 2339 | [1, 2, 3, 4, 5, 6, 7, 8, 9, | 0x98190102030405060708090a0b0c0d0e | 2340 | 10, 11, 12, 13, 14, 15, 16, | 0f101112131415161718181819 | 2341 | 17, 18, 19, 20, 21, 22, 23, | | 2342 | 24, 25] | | 2343 | | | 2344 | {} | 0xa0 | 2345 | | | 2346 | {1: 2, 3: 4} | 0xa201020304 | 2347 | | | 2348 | {"a": 1, "b": [2, 3]} | 0xa26161016162820203 | 2349 | | | 2350 | ["a", {"b": "c"}] | 0x826161a161626163 | 2351 | | | 2352 | {"a": "A", "b": "B", "c": | 0xa5616161416162614261636143616461 | 2353 | "C", "d": "D", "e": "E"} | 4461656145 | 2354 | | | 2355 | (_ h'0102', h'030405') | 0x5f42010243030405ff | 2356 | | | 2357 | (_ "strea", "ming") | 0x7f657374726561646d696e67ff | 2358 | | | 2359 | [_ ] | 0x9fff | 2360 | | | 2361 | [_ 1, [2, 3], [_ 4, 5]] | 0x9f018202039f0405ffff | 2362 | | | 2363 | [_ 1, [2, 3], [4, 5]] | 0x9f01820203820405ff | 2364 | | | 2365 | [1, [2, 3], [_ 4, 5]] | 0x83018202039f0405ff | 2366 | | | 2367 | [1, [_ 2, 3], [4, 5]] | 0x83019f0203ff820405 | 2368 | | | 2369 | [_ 1, 2, 3, 4, 5, 6, 7, 8, | 0x9f0102030405060708090a0b0c0d0e0f | 2370 | 9, 10, 11, 12, 13, 14, 15, | 101112131415161718181819ff | 2371 | 16, 17, 18, 19, 20, 21, 22, | | 2372 | 23, 24, 25] | | 2373 | | | 2374 | {_ "a": 1, "b": [_ 2, 3]} | 0xbf61610161629f0203ffff | 2375 | | | 2376 | ["a", {_ "b": "c"}] | 0x826161bf61626163ff | 2377 | | | 2378 | {_ "Fun": true, "Amt": -2} | 0xbf6346756ef563416d7421ff | 2379 +------------------------------+------------------------------------+ 2381 Table 4: Examples of Encoded CBOR Data Items 2383 Appendix B. Jump Table 2385 For brevity, this jump table does not show initial bytes that are 2386 reserved for future extension. It also only shows a selection of the 2387 initial bytes that can be used for optional features. (All unsigned 2388 integers are in network byte order.) 2390 +------------+------------------------------------------------------+ 2391 | Byte | Structure/Semantics | 2392 +------------+------------------------------------------------------+ 2393 | 0x00..0x17 | Integer 0x00..0x17 (0..23) | 2394 | | | 2395 | 0x18 | Unsigned integer (one-byte uint8_t follows) | 2396 | | | 2397 | 0x19 | Unsigned integer (two-byte uint16_t follows) | 2398 | | | 2399 | 0x1a | Unsigned integer (four-byte uint32_t follows) | 2400 | | | 2401 | 0x1b | Unsigned integer (eight-byte uint64_t follows) | 2402 | | | 2403 | 0x20..0x37 | Negative integer -1-0x00..-1-0x17 (-1..-24) | 2404 | | | 2405 | 0x38 | Negative integer -1-n (one-byte uint8_t for n | 2406 | | follows) | 2407 | | | 2408 | 0x39 | Negative integer -1-n (two-byte uint16_t for n | 2409 | | follows) | 2410 | | | 2411 | 0x3a | Negative integer -1-n (four-byte uint32_t for n | 2412 | | follows) | 2413 | | | 2414 | 0x3b | Negative integer -1-n (eight-byte uint64_t for n | 2415 | | follows) | 2416 | | | 2417 | 0x40..0x57 | byte string (0x00..0x17 bytes follow) | 2418 | | | 2419 | 0x58 | byte string (one-byte uint8_t for n, and then n | 2420 | | bytes follow) | 2421 | | | 2422 | 0x59 | byte string (two-byte uint16_t for n, and then n | 2423 | | bytes follow) | 2424 | | | 2425 | 0x5a | byte string (four-byte uint32_t for n, and then n | 2426 | | bytes follow) | 2427 | | | 2428 | 0x5b | byte string (eight-byte uint64_t for n, and then n | 2429 | | bytes follow) | 2430 | | | 2431 | 0x5f | byte string, byte strings follow, terminated by | 2432 | | "break" | 2433 | | | 2434 | 0x60..0x77 | UTF-8 string (0x00..0x17 bytes follow) | 2435 | | | 2436 | 0x78 | UTF-8 string (one-byte uint8_t for n, and then n | 2437 | | bytes follow) | 2438 | | | 2439 | 0x79 | UTF-8 string (two-byte uint16_t for n, and then n | 2440 | | bytes follow) | 2441 | | | 2442 | 0x7a | UTF-8 string (four-byte uint32_t for n, and then n | 2443 | | bytes follow) | 2444 | | | 2445 | 0x7b | UTF-8 string (eight-byte uint64_t for n, and then n | 2446 | | bytes follow) | 2447 | | | 2448 | 0x7f | UTF-8 string, UTF-8 strings follow, terminated by | 2449 | | "break" | 2450 | | | 2451 | 0x80..0x97 | array (0x00..0x17 data items follow) | 2452 | | | 2453 | 0x98 | array (one-byte uint8_t for n, and then n data items | 2454 | | follow) | 2455 | | | 2456 | 0x99 | array (two-byte uint16_t for n, and then n data | 2457 | | items follow) | 2458 | | | 2459 | 0x9a | array (four-byte uint32_t for n, and then n data | 2460 | | items follow) | 2461 | | | 2462 | 0x9b | array (eight-byte uint64_t for n, and then n data | 2463 | | items follow) | 2464 | | | 2465 | 0x9f | array, data items follow, terminated by "break" | 2466 | | | 2467 | 0xa0..0xb7 | map (0x00..0x17 pairs of data items follow) | 2468 | | | 2469 | 0xb8 | map (one-byte uint8_t for n, and then n pairs of | 2470 | | data items follow) | 2471 | | | 2472 | 0xb9 | map (two-byte uint16_t for n, and then n pairs of | 2473 | | data items follow) | 2474 | | | 2475 | 0xba | map (four-byte uint32_t for n, and then n pairs of | 2476 | | data items follow) | 2477 | | | 2478 | 0xbb | map (eight-byte uint64_t for n, and then n pairs of | 2479 | | data items follow) | 2480 | | | 2481 | 0xbf | map, pairs of data items follow, terminated by | 2482 | | "break" | 2483 | | | 2484 | 0xc0 | Text-based date/time (data item follows; see | 2485 | | Section 3.4.1) | 2486 | | | 2487 | 0xc1 | Epoch-based date/time (data item follows; see | 2488 | | Section 3.4.1) | 2489 | | | 2490 | 0xc2 | Positive bignum (data item "byte string" follows) | 2491 | | | 2492 | 0xc3 | Negative bignum (data item "byte string" follows) | 2493 | | | 2494 | 0xc4 | Decimal Fraction (data item "array" follows; see | 2495 | | Section 3.4.3) | 2496 | | | 2497 | 0xc5 | Bigfloat (data item "array" follows; see | 2498 | | Section 3.4.3) | 2499 | | | 2500 | 0xc6..0xd4 | (tagged item) | 2501 | | | 2502 | 0xd5..0xd7 | Expected Conversion (data item follows; see | 2503 | | Section 3.4.4.2) | 2504 | | | 2505 | 0xd8..0xdb | (more tagged items, 1/2/4/8 bytes and then a data | 2506 | | item follow) | 2507 | | | 2508 | 0xe0..0xf3 | (simple value) | 2509 | | | 2510 | 0xf4 | False | 2511 | | | 2512 | 0xf5 | True | 2513 | | | 2514 | 0xf6 | Null | 2515 | | | 2516 | 0xf7 | Undefined | 2517 | | | 2518 | 0xf8 | (simple value, one byte follows) | 2519 | | | 2520 | 0xf9 | Half-Precision Float (two-byte IEEE 754) | 2521 | | | 2522 | 0xfa | Single-Precision Float (four-byte IEEE 754) | 2523 | | | 2524 | 0xfb | Double-Precision Float (eight-byte IEEE 754) | 2525 | | | 2526 | 0xff | "break" stop code | 2527 +------------+------------------------------------------------------+ 2529 Table 5: Jump Table for Initial Byte 2531 Appendix C. Pseudocode 2533 The well-formedness of a CBOR item can be checked by the pseudocode 2534 in Figure 1. The data is well-formed if and only if: 2536 o the pseudocode does not "fail"; 2538 o after execution of the pseudocode, no bytes are left in the input 2539 (except in streaming applications) 2541 The pseudocode has the following prerequisites: 2543 o take(n) reads n bytes from the input data and returns them as a 2544 byte string. If n bytes are no longer available, take(n) fails. 2546 o uint() converts a byte string into an unsigned integer by 2547 interpreting the byte string in network byte order. 2549 o Arithmetic works as in C. 2551 o All variables are unsigned integers of sufficient range. 2553 well_formed (breakable = false) { 2554 // process initial bytes 2555 ib = uint(take(1)); 2556 mt = ib >> 5; 2557 val = ai = ib & 0x1f; 2558 switch (ai) { 2559 case 24: val = uint(take(1)); break; 2560 case 25: val = uint(take(2)); break; 2561 case 26: val = uint(take(4)); break; 2562 case 27: val = uint(take(8)); break; 2563 case 28: case 29: case 30: fail(); 2564 case 31: 2565 return well_formed_indefinite(mt, breakable); 2566 } 2567 // process content 2568 switch (mt) { 2569 // case 0, 1, 7 do not have content; just use val 2570 case 2: case 3: take(val); break; // bytes/UTF-8 2571 case 4: for (i = 0; i < val; i++) well_formed(); break; 2572 case 5: for (i = 0; i < val*2; i++) well_formed(); break; 2573 case 6: well_formed(); break; // 1 embedded data item 2574 } 2575 return mt; // finite data item 2576 } 2578 well_formed_indefinite(mt, breakable) { 2579 switch (mt) { 2580 case 2: case 3: 2581 while ((it = well_formed(true)) != -1) 2582 if (it != mt) // need finite embedded 2583 fail(); // of same type 2584 break; 2585 case 4: while (well_formed(true) != -1); break; 2586 case 5: while (well_formed(true) != -1) well_formed(); break; 2587 case 7: 2588 if (breakable) 2589 return -1; // signal break out 2590 else fail(); // no enclosing indefinite 2591 default: fail(); // wrong mt 2592 } 2593 return 0; // no break out 2594 } 2596 Figure 1: Pseudocode for Well-Formedness Check 2598 Note that the remaining complexity of a complete CBOR decoder is 2599 about presenting data that has been parsed to the application in an 2600 appropriate form. 2602 Major types 0 and 1 are designed in such a way that they can be 2603 encoded in C from a signed integer without actually doing an if-then- 2604 else for positive/negative (Figure 2). This uses the fact that 2605 (-1-n), the transformation for major type 1, is the same as ~n 2606 (bitwise complement) in C unsigned arithmetic; ~n can then be 2607 expressed as (-1)^n for the negative case, while 0^n leaves n 2608 unchanged for non-negative. The sign of a number can be converted to 2609 -1 for negative and 0 for non-negative (0 or positive) by arithmetic- 2610 shifting the number by one bit less than the bit length of the number 2611 (for example, by 63 for 64-bit numbers). 2613 void encode_sint(int64_t n) { 2614 uint64t ui = n >> 63; // extend sign to whole length 2615 mt = ui & 0x20; // extract major type 2616 ui ^= n; // complement negatives 2617 if (ui < 24) 2618 *p++ = mt + ui; 2619 else if (ui < 256) { 2620 *p++ = mt + 24; 2621 *p++ = ui; 2622 } else 2623 ... 2625 Figure 2: Pseudocode for Encoding a Signed Integer 2627 Appendix D. Half-Precision 2629 As half-precision floating-point numbers were only added to IEEE 754 2630 in 2008, today's programming platforms often still only have limited 2631 support for them. It is very easy to include at least decoding 2632 support for them even without such support. An example of a small 2633 decoder for half-precision floating-point numbers in the C language 2634 is shown in Figure 3. A similar program for Python is in Figure 4; 2635 this code assumes that the 2-byte value has already been decoded as 2636 an (unsigned short) integer in network byte order (as would be done 2637 by the pseudocode in Appendix C). 2639 #include 2641 double decode_half(unsigned char *halfp) { 2642 int half = (halfp[0] << 8) + halfp[1]; 2643 int exp = (half >> 10) & 0x1f; 2644 int mant = half & 0x3ff; 2645 double val; 2646 if (exp == 0) val = ldexp(mant, -24); 2647 else if (exp != 31) val = ldexp(mant + 1024, exp - 25); 2648 else val = mant == 0 ? INFINITY : NAN; 2649 return half & 0x8000 ? -val : val; 2650 } 2652 Figure 3: C Code for a Half-Precision Decoder 2654 import struct 2655 from math import ldexp 2657 def decode_single(single): 2658 return struct.unpack("!f", struct.pack("!I", single))[0] 2660 def decode_half(half): 2661 valu = (half & 0x7fff) << 13 | (half & 0x8000) << 16 2662 if ((half & 0x7c00) != 0x7c00): 2663 return ldexp(decode_single(valu), 112) 2664 return decode_single(valu | 0x7f800000) 2666 Figure 4: Python Code for a Half-Precision Decoder 2668 Appendix E. Comparison of Other Binary Formats to CBOR's Design 2669 Objectives 2671 The proposal for CBOR follows a history of binary formats that is as 2672 long as the history of computers themselves. Different formats have 2673 had different objectives. In most cases, the objectives of the 2674 format were never stated, although they can sometimes be implied by 2675 the context where the format was first used. Some formats were meant 2676 to be universally usable, although history has proven that no binary 2677 format meets the needs of all protocols and applications. 2679 CBOR differs from many of these formats due to it starting with a set 2680 of objectives and attempting to meet just those. This section 2681 compares a few of the dozens of formats with CBOR's objectives in 2682 order to help the reader decide if they want to use CBOR or a 2683 different format for a particular protocol or application. 2685 Note that the discussion here is not meant to be a criticism of any 2686 format: to the best of our knowledge, no format before CBOR was meant 2687 to cover CBOR's objectives in the priority we have assigned them. A 2688 brief recap of the objectives from Section 1.1 is: 2690 1. unambiguous encoding of most common data formats from Internet 2691 standards 2693 2. code compactness for encoder or decoder 2695 3. no schema description needed 2697 4. reasonably compact serialization 2699 5. applicability to constrained and unconstrained applications 2701 6. good JSON conversion 2703 7. extensibility 2705 E.1. ASN.1 DER, BER, and PER 2707 [ASN.1] has many serializations. In the IETF, DER and BER are the 2708 most common. The serialized output is not particularly compact for 2709 many items, and the code needed to decode numeric items can be 2710 complex on a constrained device. 2712 Few (if any) IETF protocols have adopted one of the several variants 2713 of Packed Encoding Rules (PER). There could be many reasons for 2714 this, but one that is commonly stated is that PER makes use of the 2715 schema even for parsing the surface structure of the data stream, 2716 requiring significant tool support. There are different versions of 2717 the ASN.1 schema language in use, which has also hampered adoption. 2719 E.2. MessagePack 2721 [MessagePack] is a concise, widely implemented counted binary 2722 serialization format, similar in many properties to CBOR, although 2723 somewhat less regular. While the data model can be used to represent 2724 JSON data, MessagePack has also been used in many remote procedure 2725 call (RPC) applications and for long-term storage of data. 2727 MessagePack has been essentially stable since it was first published 2728 around 2011; it has not yet had a transition. The evolution of 2729 MessagePack is impeded by an imperative to maintain complete 2730 backwards compatibility with existing stored data, while only few 2731 bytecodes are still available for extension. Repeated requests over 2732 the years from the MessagePack user community to separate out binary 2733 and text strings in the encoding recently have led to an extension 2734 proposal that would leave MessagePack's "raw" data ambiguous between 2735 its usages for binary and text data. The extension mechanism for 2736 MessagePack remains unclear. 2738 E.3. BSON 2740 [BSON] is a data format that was developed for the storage of JSON- 2741 like maps (JSON objects) in the MongoDB database. Its major 2742 distinguishing feature is the capability for in-place update, 2743 foregoing a compact representation. BSON uses a counted 2744 representation except for map keys, which are null-byte terminated. 2745 While BSON can be used for the representation of JSON-like objects on 2746 the wire, its specification is dominated by the requirements of the 2747 database application and has become somewhat baroque. The status of 2748 how BSON extensions will be implemented remains unclear. 2750 E.4. UBJSON 2752 [UBJSON] has a design goal to make JSON faster and somewhat smaller, 2753 using a binary format that is limited to exactly the data model JSON 2754 uses. Thus, there is expressly no intention to support, for example, 2755 binary data; however, there is a "high-precision number", expressed 2756 as a character string in JSON syntax. UBJSON is not optimized for 2757 code compactness, and its type byte coding is optimized for human 2758 recognition and not for compact representation of native types such 2759 as small integers. Although UBJSON is mostly counted, it provides a 2760 reserved "unknown-length" value to support streaming of arrays and 2761 maps (JSON objects). Within these containers, UBJSON also has a 2762 "Noop" type for padding. 2764 E.5. MSDTP: RFC 713 2766 Message Services Data Transmission (MSDTP) is a very early example of 2767 a compact message format; it is described in [RFC0713], written in 2768 1976. It is included here for its historical value, not because it 2769 was ever widely used. 2771 E.6. Conciseness on the Wire 2773 While CBOR's design objective of code compactness for encoders and 2774 decoders is a higher priority than its objective of conciseness on 2775 the wire, many people focus on the wire size. Table 6 shows some 2776 encoding examples for the simple nested array [1, [2, 3]]; where some 2777 form of indefinite-length encoding is supported by the encoding, 2778 [_ 1, [2, 3]] (indefinite length on the outer array) is also shown. 2780 +-------------+--------------------------+--------------------------+ 2781 | Format | [1, [2, 3]] | [_ 1, [2, 3]] | 2782 +-------------+--------------------------+--------------------------+ 2783 | RFC 713 | c2 05 81 c2 02 82 83 | | 2784 | | | | 2785 | ASN.1 BER | 30 0b 02 01 01 30 06 02 | 30 80 02 01 01 30 06 02 | 2786 | | 01 02 02 01 03 | 01 02 02 01 03 00 00 | 2787 | | | | 2788 | MessagePack | 92 01 92 02 03 | | 2789 | | | | 2790 | BSON | 22 00 00 00 10 30 00 01 | | 2791 | | 00 00 00 04 31 00 13 00 | | 2792 | | 00 00 10 30 00 02 00 00 | | 2793 | | 00 10 31 00 03 00 00 00 | | 2794 | | 00 00 | | 2795 | | | | 2796 | UBJSON | 61 02 42 01 61 02 42 02 | 61 ff 42 01 61 02 42 02 | 2797 | | 42 03 | 42 03 45 | 2798 | | | | 2799 | CBOR | 82 01 82 02 03 | 9f 01 82 02 03 ff | 2800 +-------------+--------------------------+--------------------------+ 2802 Table 6: Examples for Different Levels of Conciseness 2804 Appendix F. Changes from RFC 7049 2806 The following is a list of known changes from RFC 7049. This list is 2807 non-authoritative. It is meant to help reviewers see the significant 2808 differences. 2810 o Updated reference for [RFC4267] to [RFC7159] in many places 2812 o Updated reference for [CNN-TERMS] to [RFC7228] 2814 o Added a comment to the last example in Section 2.2.1 (added 2815 "Second value") 2817 o Fixed a bug in the example in Section 2.4.2 ("29" -> "49") 2819 o Fixed a bug in the last paragraph of Section 3.6 ("0b000_11101" -> 2820 "0b000_11001") 2822 Authors' Addresses 2823 Carsten Bormann 2824 Universitaet Bremen TZI 2825 Postfach 330440 2826 D-28359 Bremen 2827 Germany 2829 Phone: +49-421-218-63921 2830 EMail: cabo@tzi.org 2832 Paul Hoffman 2833 ICANN 2835 EMail: paul.hoffman@icann.org