idnits 2.17.1 draft-ietf-cbor-7049bis-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (September 20, 2018) is 2044 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '2' on line 2293 -- Looks like a reference, but probably isn't: '3' on line 2293 -- Looks like a reference, but probably isn't: '4' on line 2291 -- Looks like a reference, but probably isn't: '5' on line 2291 -- Looks like a reference, but probably isn't: '100' on line 1494 == Missing Reference: '-1' is mentioned on line 1490, but not defined -- Looks like a reference, but probably isn't: '1' on line 2570 == Missing Reference: 'RFCthis' is mentioned on line 1927, but not defined == Missing Reference: 'TM' is mentioned on line 2110, but not defined -- Looks like a reference, but probably isn't: '0' on line 2586 == Missing Reference: 'RFC4267' is mentioned on line 2738, but not defined == Missing Reference: 'CNN-TERMS' is mentioned on line 2740, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'ECMA262' -- Obsolete informational reference (is this intentional?): RFC 7049 (Obsoleted by RFC 8949) Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Bormann 3 Internet-Draft Universitaet Bremen TZI 4 Intended status: Standards Track P. Hoffman 5 Expires: March 24, 2019 ICANN 6 September 20, 2018 8 Concise Binary Object Representation (CBOR) 9 draft-ietf-cbor-7049bis-03 11 Abstract 13 The Concise Binary Object Representation (CBOR) is a data format 14 whose design goals include the possibility of extremely small code 15 size, fairly small message size, and extensibility without the need 16 for version negotiation. These design goals make it different from 17 earlier binary serializations such as ASN.1 and MessagePack. 19 Contributing 21 This document is being worked on in the CBOR Working Group. Please 22 contribute on the mailing list there, or in the GitHub repository for 23 this draft: https://github.com/cbor-wg/CBORbis 25 The charter for the CBOR Working Group says that the WG will update 26 RFC 7049 to fix verified errata. Security issues and clarifications 27 may be addressed, but changes to this document will ensure backward 28 compatibility for popular deployed codebases. This document will be 29 targeted at becoming an Internet Standard. 31 Status of This Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at https://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on March 24, 2019. 48 Copyright Notice 50 Copyright (c) 2018 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents 55 (https://trustee.ietf.org/license-info) in effect on the date of 56 publication of this document. Please review these documents 57 carefully, as they describe your rights and restrictions with respect 58 to this document. Code Components extracted from this document must 59 include Simplified BSD License text as described in Section 4.e of 60 the Trust Legal Provisions and are provided without warranty as 61 described in the Simplified BSD License. 63 Table of Contents 65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 66 1.1. Objectives . . . . . . . . . . . . . . . . . . . . . . . 4 67 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 68 2. CBOR Data Models . . . . . . . . . . . . . . . . . . . . . . 6 69 2.1. Extended Generic Data Models . . . . . . . . . . . . . . 7 70 2.2. Specific Data Models . . . . . . . . . . . . . . . . . . 8 71 3. Specification of the CBOR Encoding . . . . . . . . . . . . . 8 72 3.1. Major Types . . . . . . . . . . . . . . . . . . . . . . . 9 73 3.2. Indefinite Lengths for Some Major Types . . . . . . . . . 11 74 3.2.1. Indefinite-Length Arrays and Maps . . . . . . . . . . 11 75 3.2.2. Indefinite-Length Byte Strings and Text Strings . . . 13 76 3.3. Floating-Point Numbers and Values with No Content . . . . 14 77 3.4. Optional Tagging of Items . . . . . . . . . . . . . . . . 16 78 3.4.1. Date and Time . . . . . . . . . . . . . . . . . . . . 18 79 3.4.2. Bignums . . . . . . . . . . . . . . . . . . . . . . . 18 80 3.4.3. Decimal Fractions and Bigfloats . . . . . . . . . . . 19 81 3.4.4. Content Hints . . . . . . . . . . . . . . . . . . . . 20 82 3.4.4.1. Encoded CBOR Data Item . . . . . . . . . . . . . 20 83 3.4.4.2. Expected Later Encoding for CBOR-to-JSON 84 Converters . . . . . . . . . . . . . . . . . . . 20 85 3.4.4.3. Encoded Text . . . . . . . . . . . . . . . . . . 21 86 3.4.5. Self-Describe CBOR . . . . . . . . . . . . . . . . . 21 87 4. Creating CBOR-Based Protocols . . . . . . . . . . . . . . . . 22 88 4.1. CBOR in Streaming Applications . . . . . . . . . . . . . 23 89 4.2. Generic Encoders and Decoders . . . . . . . . . . . . . . 23 90 4.3. Syntax Errors . . . . . . . . . . . . . . . . . . . . . . 24 91 4.3.1. Incomplete CBOR Data Items . . . . . . . . . . . . . 24 92 4.3.2. Malformed Indefinite-Length Items . . . . . . . . . . 24 93 4.3.3. Unknown Additional Information Values . . . . . . . . 25 94 4.4. Other Decoding Errors . . . . . . . . . . . . . . . . . . 25 95 4.5. Handling Unknown Simple Values and Tags . . . . . . . . . 26 96 4.6. Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 26 97 4.7. Specifying Keys for Maps . . . . . . . . . . . . . . . . 27 98 4.7.1. Equivalence of Keys . . . . . . . . . . . . . . . . . 28 99 4.8. Undefined Values . . . . . . . . . . . . . . . . . . . . 29 100 4.9. Canonical CBOR . . . . . . . . . . . . . . . . . . . . . 29 101 4.9.1. Length-first map key ordering . . . . . . . . . . . . 31 102 4.10. Strict Mode . . . . . . . . . . . . . . . . . . . . . . . 32 103 5. Converting Data between CBOR and JSON . . . . . . . . . . . . 33 104 5.1. Converting from CBOR to JSON . . . . . . . . . . . . . . 33 105 5.2. Converting from JSON to CBOR . . . . . . . . . . . . . . 35 106 6. Future Evolution of CBOR . . . . . . . . . . . . . . . . . . 36 107 6.1. Extension Points . . . . . . . . . . . . . . . . . . . . 36 108 6.2. Curating the Additional Information Space . . . . . . . . 37 109 7. Diagnostic Notation . . . . . . . . . . . . . . . . . . . . . 37 110 7.1. Encoding Indicators . . . . . . . . . . . . . . . . . . . 38 111 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 39 112 8.1. Simple Values Registry . . . . . . . . . . . . . . . . . 39 113 8.2. Tags Registry . . . . . . . . . . . . . . . . . . . . . . 39 114 8.3. Media Type ("MIME Type") . . . . . . . . . . . . . . . . 40 115 8.4. CoAP Content-Format . . . . . . . . . . . . . . . . . . . 41 116 8.5. The +cbor Structured Syntax Suffix Registration . . . . . 41 117 9. Security Considerations . . . . . . . . . . . . . . . . . . . 42 118 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 42 119 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 43 120 11.1. Normative References . . . . . . . . . . . . . . . . . . 43 121 11.2. Informative References . . . . . . . . . . . . . . . . . 44 122 Appendix A. Examples . . . . . . . . . . . . . . . . . . . . . . 46 123 Appendix B. Jump Table . . . . . . . . . . . . . . . . . . . . . 50 124 Appendix C. Pseudocode . . . . . . . . . . . . . . . . . . . . . 53 125 Appendix D. Half-Precision . . . . . . . . . . . . . . . . . . . 55 126 Appendix E. Comparison of Other Binary Formats to CBOR's Design 127 Objectives . . . . . . . . . . . . . . . . . . . . . 56 128 E.1. ASN.1 DER, BER, and PER . . . . . . . . . . . . . . . . . 57 129 E.2. MessagePack . . . . . . . . . . . . . . . . . . . . . . . 57 130 E.3. BSON . . . . . . . . . . . . . . . . . . . . . . . . . . 58 131 E.4. UBJSON . . . . . . . . . . . . . . . . . . . . . . . . . 58 132 E.5. MSDTP: RFC 713 . . . . . . . . . . . . . . . . . . . . . 58 133 E.6. Conciseness on the Wire . . . . . . . . . . . . . . . . . 58 134 Appendix F. Changes from RFC 7049 . . . . . . . . . . . . . . . 59 135 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 59 137 1. Introduction 139 There are hundreds of standardized formats for binary representation 140 of structured data (also known as binary serialization formats). Of 141 those, some are for specific domains of information, while others are 142 generalized for arbitrary data. In the IETF, probably the best-known 143 formats in the latter category are ASN.1's BER and DER [ASN.1]. 145 The format defined here follows some specific design goals that are 146 not well met by current formats. The underlying data model is an 147 extended version of the JSON data model [RFC8259]. It is important 148 to note that this is not a proposal that the grammar in RFC 8259 be 149 extended in general, since doing so would cause a significant 150 backwards incompatibility with already deployed JSON documents. 151 Instead, this document simply defines its own data model that starts 152 from JSON. 154 Appendix E lists some existing binary formats and discusses how well 155 they do or do not fit the design objectives of the Concise Binary 156 Object Representation (CBOR). 158 1.1. Objectives 160 The objectives of CBOR, roughly in decreasing order of importance, 161 are: 163 1. The representation must be able to unambiguously encode most 164 common data formats used in Internet standards. 166 * It must represent a reasonable set of basic data types and 167 structures using binary encoding. "Reasonable" here is 168 largely influenced by the capabilities of JSON, with the major 169 addition of binary byte strings. The structures supported are 170 limited to arrays and trees; loops and lattice-style graphs 171 are not supported. 173 * There is no requirement that all data formats be uniquely 174 encoded; that is, it is acceptable that the number "7" might 175 be encoded in multiple different ways. 177 2. The code for an encoder or decoder must be able to be compact in 178 order to support systems with very limited memory, processor 179 power, and instruction sets. 181 * An encoder and a decoder need to be implementable in a very 182 small amount of code (for example, in class 1 constrained 183 nodes as defined in [RFC7228]). 185 * The format should use contemporary machine representations of 186 data (for example, not requiring binary-to-decimal 187 conversion). 189 3. Data must be able to be decoded without a schema description. 191 * Similar to JSON, encoded data should be self-describing so 192 that a generic decoder can be written. 194 4. The serialization must be reasonably compact, but data 195 compactness is secondary to code compactness for the encoder and 196 decoder. 198 * "Reasonable" here is bounded by JSON as an upper bound in 199 size, and by implementation complexity maintaining a lower 200 bound. Using either general compression schemes or extensive 201 bit-fiddling violates the complexity goals. 203 5. The format must be applicable to both constrained nodes and high- 204 volume applications. 206 * This means it must be reasonably frugal in CPU usage for both 207 encoding and decoding. This is relevant both for constrained 208 nodes and for potential usage in applications with a very high 209 volume of data. 211 6. The format must support all JSON data types for conversion to and 212 from JSON. 214 * It must support a reasonable level of conversion as long as 215 the data represented is within the capabilities of JSON. It 216 must be possible to define a unidirectional mapping towards 217 JSON for all types of data. 219 7. The format must be extensible, and the extended data must be 220 decodable by earlier decoders. 222 * The format is designed for decades of use. 224 * The format must support a form of extensibility that allows 225 fallback so that a decoder that does not understand an 226 extension can still decode the message. 228 * The format must be able to be extended in the future by later 229 IETF standards. 231 1.2. Terminology 233 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 234 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 235 document are to be interpreted as described in RFC 2119, BCP 14 236 [RFC2119] and indicate requirement levels for compliant CBOR 237 implementations. 239 The term "byte" is used in its now-customary sense as a synonym for 240 "octet". All multi-byte values are encoded in network byte order 241 (that is, most significant byte first, also known as "big-endian"). 243 This specification makes use of the following terminology: 245 Data item: A single piece of CBOR data. The structure of a data 246 item may contain zero, one, or more nested data items. The term 247 is used both for the data item in representation format and for 248 the abstract idea that can be derived from that by a decoder. 250 Decoder: A process that decodes a CBOR data item and makes it 251 available to an application. Formally speaking, a decoder 252 contains a parser to break up the input using the syntax rules of 253 CBOR, as well as a semantic processor to prepare the data in a 254 form suitable to the application. 256 Encoder: A process that generates the representation format of a 257 CBOR data item from application information. 259 Data Stream: A sequence of zero or more data items, not further 260 assembled into a larger containing data item. The independent 261 data items that make up a data stream are sometimes also referred 262 to as "top-level data items". 264 Well-formed: A data item that follows the syntactic structure of 265 CBOR. A well-formed data item uses the initial bytes and the byte 266 strings and/or data items that are implied by their values as 267 defined in CBOR and is not followed by extraneous data. 269 Valid: A data item that is well-formed and also follows the semantic 270 restrictions that apply to CBOR data items. 272 Stream decoder: A process that decodes a data stream and makes each 273 of the data items in the sequence available to an application as 274 they are received. 276 Where bit arithmetic or data types are explained, this document uses 277 the notation familiar from the programming language C, except that 278 "**" denotes exponentiation. Similar to the "0x" notation for 279 hexadecimal numbers, numbers in binary notation are prefixed with 280 "0b". Underscores can be added to such a number solely for 281 readability, so 0b00100001 (0x21) might be written 0b001_00001 to 282 emphasize the desired interpretation of the bits in the byte; in this 283 case, it is split into three bits and five bits. 285 2. CBOR Data Models 287 CBOR is explicit about its generic data model, which defines the set 288 of all data items that can be represented in CBOR. Its basic generic 289 data model is extensible by the registration of simple type values 290 and tags. Applications can then subset the resulting extended 291 generic data model to build their specific data models. 293 Within environments that can represent the data items in the generic 294 data model, generic CBOR encoders and decoders can be implemented 295 (which usually involves defining additional implementation data types 296 for those data items that do not already have a natural 297 representation in the environment). The ability to provide generic 298 encoders and decoders is an explicit design goal of CBOR; however 299 many applications will provide their own application-specific 300 encoders and/or decoders. 302 In the basic (un-extended) generic data model, a data item is one of: 304 o an integer in the range -2**64..2**64-1 inclusive 306 o a simple value, identified by a number between 0 and 255, but 307 distinct from that number 309 o a floating point value, distinct from an integer, out of the set 310 representable by IEEE 754 binary64 (including non-finites) 312 o a sequence of zero or more bytes ("byte string") 314 o a sequence of zero or more Unicode code points ("text string") 316 o a sequence of zero or more data items ("array") 318 o a mapping (mathematical function) from zero or more data items 319 ("keys") each to a data item ("values"), ("map") 321 o a tagged data item, comprising a tag (an integer in the range 322 0..2**64-1) and a value (a data item) 324 Note that integer and floating-point values are distinct in this 325 model, even if they have the same numeric value. 327 2.1. Extended Generic Data Models 329 This basic generic data model comes pre-extended by the registration 330 of a number of simple values and tags right in this document, such 331 as: 333 o "false", "true", "null", and "undefined" (simple values identified 334 by 20..23) 336 o integer and floating point values with a larger range and 337 precision than the above (tags 2 to 5) 339 o application data types such as a point in time or an RFC 3339 340 date/time string (tags 1, 0) 342 Further elements of the extended generic data model can be (and have 343 been) defined via the IANA registries created for CBOR. Even if such 344 an extension is unknown to a generic encoder or decoder, data items 345 using that extension can be passed to or from the application by 346 representing them at the interface to the application within the 347 basic generic data model, i.e., as generic values of a simple type or 348 generic tagged items. 350 In other words, the basic generic data model is stable as defined in 351 this document, while the extended generic data model expands by the 352 registration of new simple values or tags, but never shrinks. 354 While there is a strong expectation that generic encoders and 355 decoders can represent "false", "true", and "null" ("undefined" is 356 intentionally omitted) in the form appropriate for their programming 357 environment, implementation of the data model extensions created by 358 tags is truly optional and a matter of implementation quality. 360 2.2. Specific Data Models 362 The specific data model for a CBOR-based protocol usually subsets the 363 extended generic data model and assigns application semantics to the 364 data items within this subset and its components. When documenting 365 such specific data models, where it is desired to specify the types 366 of data items, it is preferred to identify the types by their names 367 in the generic data model ("negative integer", "array") instead of by 368 referring to aspects of their CBOR representation ("major type 1", 369 "major type 4"). 371 Specific data models can also specify that values of different types 372 are equivalent for the purposes of map keys and encoder freedom. For 373 example, in the generic data model, a valid map MAY have both "0" and 374 "0.0" as keys, and an encoder MUST NOT encode "0.0" as an integer 375 (major type 0, Section 3.1). However, if a specific data model 376 declares that floating point and integer representations of integral 377 values are equivalent, map keys "0" and "0.0" would be considered 378 duplicates and so invalid, and an encoder could encode integral- 379 valued floats as integers or vice versa, perhaps to save encoded 380 bytes. 382 3. Specification of the CBOR Encoding 384 A CBOR data item (Section 2) is encoded to or decoded from a byte 385 string as described in this section. The encoding is summarized in 386 Table 5. 388 The initial byte of each encoded data item contains both information 389 about the major type (the high-order 3 bits, described in 390 Section 3.1) and additional information (the low-order 5 bits). 391 Additional information value 31 is used for indefinite-length items, 392 described in Section 3.2. Additional information values 28 to 30 are 393 reserved for future expansion. 395 Additional information values from 0 to 27 describes how to construct 396 an "argument", possibly consuming additional bytes. For major type 7 397 and additional information 25 to 27 (floating point numbers), there 398 is a special case; in all other cases the additional information 399 value, possibly combined with following bytes, the argument 400 constructed is an unsigned integer. 402 When the value of the additional information is less than 24, it is 403 directly used as the argument's value. When it is 24 to 27, the 404 argument's value is held in the following 1, 2, 4, or 8, 405 respectively, bytes, in network byte order. 407 The meaning of this argument depends on the major type. For example, 408 in major type 0, the argument is the value of the data item itself 409 (and in major type 1 the value of the data item is computed from the 410 argument); in major type 2 and 3 it gives the length of the string 411 data in bytes that follows; and in major types 4 and 5 it is used to 412 determine the number of data items enclosed. 414 If the encoded sequence of bytes ends before the end of a data item 415 would be reached, that encoding is not well-formed. If the encoded 416 sequence of bytes still has bytes remaining after the outermost 417 encoded item is parsed, that encoding is not a single well-formed 418 CBOR item. 420 A CBOR decoder implementation can be based on a jump table with all 421 256 defined values for the initial byte (Table 5). A decoder in a 422 constrained implementation can instead use the structure of the 423 initial byte and following bytes for more compact code (see 424 Appendix C for a rough impression of how this could look). 426 3.1. Major Types 428 The following lists the major types and the additional information 429 and other bytes associated with the type. 431 Major type 0: an integer in the range 0..2**64-1 inclusive. The 432 value of the encoded item is the argument itself. For example, 433 the integer 10 is denoted as the one byte 0b000_01010 (major type 434 0, additional information 10). The integer 500 would be 435 0b000_11001 (major type 0, additional information 25) followed by 436 the two bytes 0x01f4, which is 500 in decimal. 438 Major type 1: a negative integer in the range -2**64..-1 inclusive. 439 The value of the item is -1 minus the argument. For example, the 440 integer -500 would be 0b001_11001 (major type 1, additional 441 information 25) followed by the two bytes 0x01f3, which is 499 in 442 decimal. 444 Major type 2: a byte string. The number of bytes in the string is 445 equal to the argument. For example, a byte string whose length is 446 5 would have an initial byte of 0b010_00101 (major type 2, 447 additional information 5 for the length), followed by 5 bytes of 448 binary content. A byte string whose length is 500 would have 3 449 initial bytes of 0b010_11001 (major type 2, additional information 450 25 to indicate a two-byte length) followed by the two bytes 0x01f4 451 for a length of 500, followed by 500 bytes of binary content. 453 Major type 3: a text string (Section 2), encoded as UTF-8 454 ([RFC3629]). The number of bytes in the string is equal to the 455 argument. A string containing an invalid UTF-8 sequence is well- 456 formed but invalid. This type is provided for systems that need 457 to interpret or display human-readable text, and allows the 458 differentiation between unstructured bytes and text that has a 459 specified repertoire and encoding. In contrast to formats such as 460 JSON, the Unicode characters in this type are never escaped. 461 Thus, a newline character (U+000A) is always represented in a 462 string as the byte 0x0a, and never as the bytes 0x5c6e (the 463 characters "\" and "n") or as 0x5c7530303061 (the characters "\", 464 "u", "0", "0", "0", and "a"). 466 Major type 4: an array of data items. Arrays are also called lists, 467 sequences, or tuples. The argument is the number of data items in 468 the array. Items in an array do not need to all be of the same 469 type. For example, an array that contains 10 items of any type 470 would have an initial byte of 0b100_01010 (major type of 4, 471 additional information of 10 for the length) followed by the 10 472 remaining items. 474 Major type 5: a map of pairs of data items. Maps are also called 475 tables, dictionaries, hashes, or objects (in JSON). A map is 476 comprised of pairs of data items, each pair consisting of a key 477 that is immediately followed by a value. The argument is the 478 number of _pairs_ of data items in the map. For example, a map 479 that contains 9 pairs would have an initial byte of 0b101_01001 480 (major type of 5, additional information of 9 for the number of 481 pairs) followed by the 18 remaining items. The first item is the 482 first key, the second item is the first value, the third item is 483 the second key, and so on. A map that has duplicate keys may be 484 well-formed, but it is not valid, and thus it causes indeterminate 485 decoding; see also Section 4.7. 487 Major type 6: a tagged data item whose tag is the argument and whose 488 value is the single following encoded item. See Section 3.4. 490 Major type 7: floating-point numbers and simple values, as well as 491 the "break" stop code. See Section 3.3. 493 These eight major types lead to a simple table showing which of the 494 256 possible values for the initial byte of a data item are used 495 (Table 5). 497 In major types 6 and 7, many of the possible values are reserved for 498 future specification. See Section 8 for more information on these 499 values. 501 3.2. Indefinite Lengths for Some Major Types 503 Four CBOR items (arrays, maps, byte strings, and text strings) can be 504 encoded with an indefinite length using additional information value 505 31. This is useful if the encoding of the item needs to begin before 506 the number of items inside the array or map, or the total length of 507 the string, is known. (The application of this is often referred to 508 as "streaming" within a data item.) 510 Indefinite-length arrays and maps are dealt with differently than 511 indefinite-length byte strings and text strings. 513 3.2.1. Indefinite-Length Arrays and Maps 515 Indefinite-length arrays and maps are simply opened without 516 indicating the number of data items that will be included in the 517 array or map, using the additional information value of 31. The 518 initial major type and additional information byte is followed by the 519 elements of the array or map, just as they would be in other arrays 520 or maps. The end of the array or map is indicated by encoding a 521 "break" stop code in a place where the next data item would normally 522 have been included. The "break" is encoded with major type 7 and 523 additional information value 31 (0b111_11111) but is not itself a 524 data item: it is just a syntactic feature to close the array or map. 525 That is, the "break" stop code comes after the last item in the array 526 or map, and it cannot occur anywhere else in place of a data item. 527 In this way, indefinite-length arrays and maps look identical to 528 other arrays and maps except for beginning with the additional 529 information value 31 and ending with the "break" stop code. 531 Arrays and maps with indefinite lengths allow any number of items 532 (for arrays) and key/value pairs (for maps) to be given before the 533 "break" stop code. There is no restriction against nesting 534 indefinite-length array or map items. A "break" only terminates a 535 single item, so nested indefinite-length items need exactly as many 536 "break" stop codes as there are type bytes starting an indefinite- 537 length item. 539 For example, assume an encoder wants to represent the abstract array 540 [1, [2, 3], [4, 5]]. The definite-length encoding would be 541 0x8301820203820405: 543 83 -- Array of length 3 544 01 -- 1 545 82 -- Array of length 2 546 02 -- 2 547 03 -- 3 548 82 -- Array of length 2 549 04 -- 4 550 05 -- 5 552 Indefinite-length encoding could be applied independently to each of 553 the three arrays encoded in this data item, as required, leading to 554 representations such as: 556 0x9f018202039f0405ffff 557 9F -- Start indefinite-length array 558 01 -- 1 559 82 -- Array of length 2 560 02 -- 2 561 03 -- 3 562 9F -- Start indefinite-length array 563 04 -- 4 564 05 -- 5 565 FF -- "break" (inner array) 566 FF -- "break" (outer array) 568 0x9f01820203820405ff 569 9F -- Start indefinite-length array 570 01 -- 1 571 82 -- Array of length 2 572 02 -- 2 573 03 -- 3 574 82 -- Array of length 2 575 04 -- 4 576 05 -- 5 577 FF -- "break" 579 0x83018202039f0405ff 580 83 -- Array of length 3 581 01 -- 1 582 82 -- Array of length 2 583 02 -- 2 584 03 -- 3 585 9F -- Start indefinite-length array 586 04 -- 4 587 05 -- 5 588 FF -- "break" 590 0x83019f0203ff820405 591 83 -- Array of length 3 592 01 -- 1 593 9F -- Start indefinite-length array 594 02 -- 2 595 03 -- 3 596 FF -- "break" 597 82 -- Array of length 2 598 04 -- 4 599 05 -- 5 601 An example of an indefinite-length map (that happens to have two key/ 602 value pairs) might be: 604 0xbf6346756ef563416d7421ff 605 BF -- Start indefinite-length map 606 63 -- First key, UTF-8 string length 3 607 46756e -- "Fun" 608 F5 -- First value, true 609 63 -- Second key, UTF-8 string length 3 610 416d74 -- "Amt" 611 21 -- Second value, -2 612 FF -- "break" 614 3.2.2. Indefinite-Length Byte Strings and Text Strings 616 Indefinite-length byte strings and text strings are actually a 617 concatenation of zero or more definite-length byte or text strings 618 ("chunks") that are together treated as one contiguous string. 619 Indefinite-length strings are opened with the major type and 620 additional information value of 31, but what follows are a series of 621 byte or text strings that have definite lengths (the chunks). The 622 end of the series of chunks is indicated by encoding the "break" stop 623 code (0b111_11111) in a place where the next chunk in the series 624 would occur. The contents of the chunks are concatenated together, 625 and the overall length of the indefinite-length string will be the 626 sum of the lengths of all of the chunks. In summary, an indefinite- 627 length string is encoded similarly to how an indefinite-length array 628 of its chunks would be encoded, except that the major type of the 629 indefinite-length string is that of a (text or byte) string and 630 matches the major types of its chunks. 632 For indefinite-length byte strings, every data item (chunk) between 633 the indefinite-length indicator and the "break" MUST be a definite- 634 length byte string item; if the parser sees any item type other than 635 a byte string before it sees the "break", it is an error. 637 For example, assume the sequence: 639 0b010_11111 0b010_00100 0xaabbccdd 0b010_00011 0xeeff99 0b111_11111 641 5F -- Start indefinite-length byte string 642 44 -- Byte string of length 4 643 aabbccdd -- Bytes content 644 43 -- Byte string of length 3 645 eeff99 -- Bytes content 646 FF -- "break" 648 After decoding, this results in a single byte string with seven 649 bytes: 0xaabbccddeeff99. 651 Text strings with indefinite lengths act the same as byte strings 652 with indefinite lengths, except that all their chunks MUST be 653 definite-length text strings. Note that this implies that the bytes 654 of a single UTF-8 character cannot be spread between chunks: a new 655 chunk can only be started at a character boundary. 657 3.3. Floating-Point Numbers and Values with No Content 659 Major type 7 is for two types of data: floating-point numbers and 660 "simple values" that do not need any content. Each value of the 661 5-bit additional information in the initial byte has its own separate 662 meaning, as defined in Table 1. Like the major types for integers, 663 items of this major type do not carry content data; all the 664 information is in the initial bytes. 666 +-------------+--------------------------------------------------+ 667 | 5-Bit Value | Semantics | 668 +-------------+--------------------------------------------------+ 669 | 0..23 | Simple value (value 0..23) | 670 | | | 671 | 24 | Simple value (value 32..255 in following byte) | 672 | | | 673 | 25 | IEEE 754 Half-Precision Float (16 bits follow) | 674 | | | 675 | 26 | IEEE 754 Single-Precision Float (32 bits follow) | 676 | | | 677 | 27 | IEEE 754 Double-Precision Float (64 bits follow) | 678 | | | 679 | 28-30 | (Unassigned) | 680 | | | 681 | 31 | "break" stop code for indefinite-length items | 682 +-------------+--------------------------------------------------+ 684 Table 1: Values for Additional Information in Major Type 7 686 As with all other major types, the 5-bit value 24 signifies a single- 687 byte extension: it is followed by an additional byte to represent the 688 simple value. (To minimize confusion, only the values 32 to 255 are 689 used.) This maintains the structure of the initial bytes: as for the 690 other major types, the length of these always depends on the 691 additional information in the first byte. Table 2 lists the values 692 assigned and available for simple types. 694 +---------+-----------------+ 695 | Value | Semantics | 696 +---------+-----------------+ 697 | 0..19 | (Unassigned) | 698 | | | 699 | 20 | False | 700 | | | 701 | 21 | True | 702 | | | 703 | 22 | Null | 704 | | | 705 | 23 | Undefined value | 706 | | | 707 | 24..31 | (Reserved) | 708 | | | 709 | 32..255 | (Unassigned) | 710 +---------+-----------------+ 712 Table 2: Simple Values 714 The 5-bit values of 25, 26, and 27 are for 16-bit, 32-bit, and 64-bit 715 IEEE 754 binary floating-point values. These floating-point values 716 are encoded in the additional bytes of the appropriate size. (See 717 Appendix D for some information about 16-bit floating point.) 719 An encoder MUST NOT encode False as the two-byte sequence of 0xf814, 720 MUST NOT encode True as the two-byte sequence of 0xf815, MUST NOT 721 encode Null as the two-byte sequence of 0xf816, and MUST NOT encode 722 Undefined value as the two-byte sequence of 0xf817. A decoder MUST 723 treat these two-byte sequences as an error. Similar prohibitions 724 apply to the unassigned simple values as well. 726 3.4. Optional Tagging of Items 728 In CBOR, a data item can optionally be preceded by a tag to give it 729 additional semantics while retaining its structure. The tag is major 730 type 6, and represents an integer number as indicated by the tag's 731 argument (Section 3); the (sole) data item is carried as content 732 data. If a tag requires structured data, this structure is encoded 733 into the nested data item. The definition of a tag usually restricts 734 what kinds of nested data item or items are valid. 736 The initial bytes of the tag follow the rules for positive integers 737 (major type 0). The tag is followed by a single data item of any 738 type. For example, assume that a byte string of length 12 is marked 739 with a tag to indicate it is a positive bignum (Section 3.4.2). This 740 would be marked as 0b110_00010 (major type 6, additional information 741 2 for the tag) followed by 0b010_01100 (major type 2, additional 742 information of 12 for the length) followed by the 12 bytes of the 743 bignum. 745 Decoders do not need to understand tags, and thus tags may be of 746 little value in applications where the implementation creating a 747 particular CBOR data item and the implementation decoding that stream 748 know the semantic meaning of each item in the data flow. Their 749 primary purpose in this specification is to define common data types 750 such as dates. A secondary purpose is to allow optional tagging when 751 the decoder is a generic CBOR decoder that might be able to benefit 752 from hints about the content of items. Understanding the semantic 753 tags is optional for a decoder; it can just jump over the initial 754 bytes of the tag and interpret the tagged data item itself. 756 A tag always applies to the item that is directly followed by it. 757 Thus, if tag A is followed by tag B, which is followed by data item 758 C, tag A applies to the result of applying tag B on data item C. 759 That is, a tagged item is a data item consisting of a tag and a 760 value. The content of the tagged item is the data item (the value) 761 that is being tagged. 763 IANA maintains a registry of tag values as described in Section 8.2. 764 Table 3 provides a list of initial values, with definitions in the 765 rest of this section. 767 +-----------+--------------+----------------------------------------+ 768 | Tag | Data Item | Semantics | 769 +-----------+--------------+----------------------------------------+ 770 | 0 | UTF-8 string | Standard date/time string; see | 771 | | | Section 3.4.1 | 772 | | | | 773 | 1 | multiple | Epoch-based date/time; see | 774 | | | Section 3.4.1 | 775 | | | | 776 | 2 | byte string | Positive bignum; see Section 3.4.2 | 777 | | | | 778 | 3 | byte string | Negative bignum; see Section 3.4.2 | 779 | | | | 780 | 4 | array | Decimal fraction; see Section 3.4.3 | 781 | | | | 782 | 5 | array | Bigfloat; see Section 3.4.3 | 783 | | | | 784 | 6..20 | (Unassigned) | (Unassigned) | 785 | | | | 786 | 21 | multiple | Expected conversion to base64url | 787 | | | encoding; see Section 3.4.4.2 | 788 | | | | 789 | 22 | multiple | Expected conversion to base64 | 790 | | | encoding; see Section 3.4.4.2 | 791 | | | | 792 | 23 | multiple | Expected conversion to base16 | 793 | | | encoding; see Section 3.4.4.2 | 794 | | | | 795 | 24 | byte string | Encoded CBOR data item; see | 796 | | | Section 3.4.4.1 | 797 | | | | 798 | 25..31 | (Unassigned) | (Unassigned) | 799 | | | | 800 | 32 | UTF-8 string | URI; see Section 3.4.4.3 | 801 | | | | 802 | 33 | UTF-8 string | base64url; see Section 3.4.4.3 | 803 | | | | 804 | 34 | UTF-8 string | base64; see Section 3.4.4.3 | 805 | | | | 806 | 35 | UTF-8 string | Regular expression; see | 807 | | | Section 3.4.4.3 | 808 | | | | 809 | 36 | UTF-8 string | MIME message; see Section 3.4.4.3 | 810 | | | | 811 | 37..55798 | (Unassigned) | (Unassigned) | 812 | | | | 813 | 55799 | multiple | Self-describe CBOR; see Section 3.4.5 | 814 | | | | 815 | 55800+ | (Unassigned) | (Unassigned) | 816 +-----------+--------------+----------------------------------------+ 818 Table 3: Values for Tags 820 3.4.1. Date and Time 822 Protocols using tag values 0 and 1 extend the generic data model 823 (Section 2) with data items representing points in time. 825 Tag value 0 is for date/time strings that follow the standard format 826 described in [RFC3339], as refined by Section 3.3 of [RFC4287]. 828 Tag value 1 is for numerical representation of seconds relative to 829 1970-01-01T00:00Z in UTC time. (For the non-negative values that the 830 Portable Operating System Interface (POSIX) defines, the number of 831 seconds is counted in the same way as for POSIX "seconds since the 832 epoch" [TIME_T].) The tagged item can be a positive or negative 833 integer (major types 0 and 1), or a floating-point number (major type 834 7 with additional information 25, 26, or 27). Note that the number 835 can be negative (time before 1970-01-01T00:00Z) and, if a floating- 836 point number, indicate fractional seconds. 838 3.4.2. Bignums 840 Protocols using tag values 2 and 3 extend the generic data model 841 (Section 2) with "bignums" representing arbitrary integers. In the 842 generic data model, bignum values are not equal to integers from the 843 basic data model, but specific data models can define that 844 equivalence. 846 Bignums are encoded as a byte string data item, which is interpreted 847 as an unsigned integer n in network byte order. For tag value 2, the 848 value of the bignum is n. For tag value 3, the value of the bignum 849 is -1 - n. Decoders that understand these tags MUST be able to 850 decode bignums that have leading zeroes. 852 For example, the number 18446744073709551616 (2**64) is represented 853 as 0b110_00010 (major type 6, tag 2), followed by 0b010_01001 (major 854 type 2, length 9), followed by 0x010000000000000000 (one byte 0x01 855 and eight bytes 0x00). In hexadecimal: 857 C2 -- Tag 2 858 49 -- Byte string of length 9 859 010000000000000000 -- Bytes content 861 3.4.3. Decimal Fractions and Bigfloats 863 Protocols using tag value 4 extend the generic data model with data 864 items representing arbitrary-length decimal fractions m*(10*e). 865 Protocols using tag value 5 extend the generic data model with data 866 items representing arbitrary-length binary fractions m*(2*e). As 867 with bignums, values of different types are not equal in the generic 868 data model. 870 Decimal fractions combine an integer mantissa with a base-10 scaling 871 factor. They are most useful if an application needs the exact 872 representation of a decimal fraction such as 1.1 because there is no 873 exact representation for many decimal fractions in binary floating 874 point. 876 Bigfloats combine an integer mantissa with a base-2 scaling factor. 877 They are binary floating-point values that can exceed the range or 878 the precision of the three IEEE 754 formats supported by CBOR 879 (Section 3.3). Bigfloats may also be used by constrained 880 applications that need some basic binary floating-point capability 881 without the need for supporting IEEE 754. 883 A decimal fraction or a bigfloat is represented as a tagged array 884 that contains exactly two integer numbers: an exponent e and a 885 mantissa m. Decimal fractions (tag 4) use base-10 exponents; the 886 value of a decimal fraction data item is m*(10**e). Bigfloats (tag 887 5) use base-2 exponents; the value of a bigfloat data item is 888 m*(2**e). The exponent e MUST be represented in an integer of major 889 type 0 or 1, while the mantissa also can be a bignum (Section 3.4.2). 891 An example of a decimal fraction is that the number 273.15 could be 892 represented as 0b110_00100 (major type of 6 for the tag, additional 893 information of 4 for the type of tag), followed by 0b100_00010 (major 894 type of 4 for the array, additional information of 2 for the length 895 of the array), followed by 0b001_00001 (major type of 1 for the first 896 integer, additional information of 1 for the value of -2), followed 897 by 0b000_11001 (major type of 0 for the second integer, additional 898 information of 25 for a two-byte value), followed by 899 0b0110101010110011 (27315 in two bytes). In hexadecimal: 901 C4 -- Tag 4 902 82 -- Array of length 2 903 21 -- -2 904 19 6ab3 -- 27315 906 An example of a bigfloat is that the number 1.5 could be represented 907 as 0b110_00101 (major type of 6 for the tag, additional information 908 of 5 for the type of tag), followed by 0b100_00010 (major type of 4 909 for the array, additional information of 2 for the length of the 910 array), followed by 0b001_00000 (major type of 1 for the first 911 integer, additional information of 0 for the value of -1), followed 912 by 0b000_00011 (major type of 0 for the second integer, additional 913 information of 3 for the value of 3). In hexadecimal: 915 C5 -- Tag 5 916 82 -- Array of length 2 917 20 -- -1 918 03 -- 3 920 Decimal fractions and bigfloats provide no representation of 921 Infinity, -Infinity, or NaN; if these are needed in place of a 922 decimal fraction or bigfloat, the IEEE 754 half-precision 923 representations from Section 3.3 can be used. For constrained 924 applications, where there is a choice between representing a specific 925 number as an integer and as a decimal fraction or bigfloat (such as 926 when the exponent is small and non-negative), there is a quality-of- 927 implementation expectation that the integer representation is used 928 directly. 930 3.4.4. Content Hints 932 The tags in this section are for content hints that might be used by 933 generic CBOR processors. These content hints do not extend the 934 generic data model. 936 3.4.4.1. Encoded CBOR Data Item 938 Sometimes it is beneficial to carry an embedded CBOR data item that 939 is not meant to be decoded immediately at the time the enclosing data 940 item is being parsed. Tag 24 (CBOR data item) can be used to tag the 941 embedded byte string as a data item encoded in CBOR format. 943 3.4.4.2. Expected Later Encoding for CBOR-to-JSON Converters 945 Tags 21 to 23 indicate that a byte string might require a specific 946 encoding when interoperating with a text-based representation. These 947 tags are useful when an encoder knows that the byte string data it is 948 writing is likely to be later converted to a particular JSON-based 949 usage. That usage specifies that some strings are encoded as base64, 950 base64url, and so on. The encoder uses byte strings instead of doing 951 the encoding itself to reduce the message size, to reduce the code 952 size of the encoder, or both. The encoder does not know whether or 953 not the converter will be generic, and therefore wants to say what it 954 believes is the proper way to convert binary strings to JSON. 956 The data item tagged can be a byte string or any other data item. In 957 the latter case, the tag applies to all of the byte string data items 958 contained in the data item, except for those contained in a nested 959 data item tagged with an expected conversion. 961 These three tag types suggest conversions to three of the base data 962 encodings defined in [RFC4648]. For base64url encoding, padding is 963 not used (see Section 3.2 of RFC 4648); that is, all trailing equals 964 signs ("=") are removed from the base64url-encoded string. Later 965 tags might be defined for other data encodings of RFC 4648 or for 966 other ways to encode binary data in strings. 968 3.4.4.3. Encoded Text 970 Some text strings hold data that have formats widely used on the 971 Internet, and sometimes those formats can be validated and presented 972 to the application in appropriate form by the decoder. There are 973 tags for some of these formats. 975 o Tag 32 is for URIs, as defined in [RFC3986]; 977 o Tags 33 and 34 are for base64url- and base64-encoded text strings, 978 as defined in [RFC4648]; 980 o Tag 35 is for regular expressions that are roughly in Perl 981 Compatible Regular Expressions (PCRE/PCRE2) form [PCRE] or a 982 version of the JavaScript regular expression syntax [ECMA262]. 983 (Note that more specific identification may be necessary if the 984 actual version of the specification underlying the regular 985 expression, or more than just the text of the regular expression 986 itself, need to be conveyed.) 988 o Tag 36 is for MIME messages (including all headers), as defined in 989 [RFC2045]; 991 Note that tags 33 and 34 differ from 21 and 22 in that the data is 992 transported in base-encoded form for the former and in raw byte 993 string form for the latter. 995 3.4.5. Self-Describe CBOR 997 In many applications, it will be clear from the context that CBOR is 998 being employed for encoding a data item. For instance, a specific 999 protocol might specify the use of CBOR, or a media type is indicated 1000 that specifies its use. However, there may be applications where 1001 such context information is not available, such as when CBOR data is 1002 stored in a file and disambiguating metadata is not in use. Here, it 1003 may help to have some distinguishing characteristics for the data 1004 itself. 1006 Tag 55799 is defined for this purpose. It does not impart any 1007 special semantics on the data item that follows; that is, the 1008 semantics of a data item tagged with tag 55799 is exactly identical 1009 to the semantics of the data item itself. 1011 The serialization of this tag is 0xd9d9f7, which appears not to be in 1012 use as a distinguishing mark for frequently used file types. In 1013 particular, it is not a valid start of a Unicode text in any Unicode 1014 encoding if followed by a valid CBOR data item. 1016 For instance, a decoder might be able to parse both CBOR and JSON. 1017 Such a decoder would need to mechanically distinguish the two 1018 formats. An easy way for an encoder to help the decoder would be to 1019 tag the entire CBOR item with tag 55799, the serialization of which 1020 will never be found at the beginning of a JSON text. 1022 4. Creating CBOR-Based Protocols 1024 Data formats such as CBOR are often used in environments where there 1025 is no format negotiation. A specific design goal of CBOR is to not 1026 need any included or assumed schema: a decoder can take a CBOR item 1027 and decode it with no other knowledge. 1029 Of course, in real-world implementations, the encoder and the decoder 1030 will have a shared view of what should be in a CBOR data item. For 1031 example, an agreed-to format might be "the item is an array whose 1032 first value is a UTF-8 string, second value is an integer, and 1033 subsequent values are zero or more floating-point numbers" or "the 1034 item is a map that has byte strings for keys and contains at least 1035 one pair whose key is 0xab01". 1037 This specification puts no restrictions on CBOR-based protocols. An 1038 encoder can be capable of encoding as many or as few types of values 1039 as is required by the protocol in which it is used; a decoder can be 1040 capable of understanding as many or as few types of values as is 1041 required by the protocols in which it is used. This lack of 1042 restrictions allows CBOR to be used in extremely constrained 1043 environments. 1045 This section discusses some considerations in creating CBOR-based 1046 protocols. It is advisory only and explicitly excludes any language 1047 from RFC 2119 other than words that could be interpreted as "MAY" in 1048 the sense of RFC 2119. 1050 4.1. CBOR in Streaming Applications 1052 In a streaming application, a data stream may be composed of a 1053 sequence of CBOR data items concatenated back-to-back. In such an 1054 environment, the decoder immediately begins decoding a new data item 1055 if data is found after the end of a previous data item. 1057 Not all of the bytes making up a data item may be immediately 1058 available to the decoder; some decoders will buffer additional data 1059 until a complete data item can be presented to the application. 1060 Other decoders can present partial information about a top-level data 1061 item to an application, such as the nested data items that could 1062 already be decoded, or even parts of a byte string that hasn't 1063 completely arrived yet. 1065 Note that some applications and protocols will not want to use 1066 indefinite-length encoding. Using indefinite-length encoding allows 1067 an encoder to not need to marshal all the data for counting, but it 1068 requires a decoder to allocate increasing amounts of memory while 1069 waiting for the end of the item. This might be fine for some 1070 applications but not others. 1072 4.2. Generic Encoders and Decoders 1074 A generic CBOR decoder can decode all well-formed CBOR data and 1075 present them to an application. CBOR data is well-formed if it uses 1076 the initial bytes, as well as the byte strings and/or data items that 1077 are implied by their values, in the manner defined by CBOR, and no 1078 extraneous data follows (Appendix C). 1080 Even though CBOR attempts to minimize these cases, not all well- 1081 formed CBOR data is valid: for example, the format excludes simple 1082 values below 32 that are encoded with an extension byte. Also, 1083 specific tags may make semantic constraints that may be violated, 1084 such as by including a tag in a bignum tag or by following a byte 1085 string within a date tag. Finally, the data may be invalid, such as 1086 invalid UTF-8 strings or date strings that do not conform to 1087 [RFC3339]. There is no requirement that generic encoders and 1088 decoders make unnatural choices for their application interface to 1089 enable the processing of invalid data. Generic encoders and decoders 1090 are expected to forward simple values and tags even if their specific 1091 codepoints are not registered at the time the encoder/decoder is 1092 written (Section 4.5). 1094 Generic decoders provide ways to present well-formed CBOR values, 1095 both valid and invalid, to an application. The diagnostic notation 1096 (Section 7) may be used to present well-formed CBOR values to humans. 1098 Generic encoders provide an application interface that allows the 1099 application to specify any well-formed value, including simple values 1100 and tags unknown to the encoder. 1102 4.3. Syntax Errors 1104 A decoder encountering a CBOR data item that is not well-formed 1105 generally can choose to completely fail the decoding (issue an error 1106 and/or stop processing altogether), substitute the problematic data 1107 and data items using a decoder-specific convention that clearly 1108 indicates there has been a problem, or take some other action. 1110 4.3.1. Incomplete CBOR Data Items 1112 The representation of a CBOR data item has a specific length, 1113 determined by its initial bytes and by the structure of any data 1114 items enclosed in the data items. If less data is available, this 1115 can be treated as a syntax error. A decoder may also implement 1116 incremental parsing, that is, decode the data item as far as it is 1117 available and present the data found so far (such as in an event- 1118 based interface), with the option of continuing the decoding once 1119 further data is available. 1121 Examples of incomplete data items include: 1123 o A decoder expects a certain number of array or map entries but 1124 instead encounters the end of the data. 1126 o A decoder processes what it expects to be the last pair in a map 1127 and comes to the end of the data. 1129 o A decoder has just seen a tag and then encounters the end of the 1130 data. 1132 o A decoder has seen the beginning of an indefinite-length item but 1133 encounters the end of the data before it sees the "break" stop 1134 code. 1136 4.3.2. Malformed Indefinite-Length Items 1138 Examples of malformed indefinite-length data items include: 1140 o Within an indefinite-length byte string or text, a decoder finds 1141 an item that is not of the appropriate major type before it finds 1142 the "break" stop code. 1144 o Within an indefinite-length map, a decoder encounters the "break" 1145 stop code immediately after reading a key (the value is missing). 1147 Another error is finding a "break" stop code at a point in the data 1148 where there is no immediately enclosing (unclosed) indefinite-length 1149 item. 1151 4.3.3. Unknown Additional Information Values 1153 At the time of writing, some additional information values are 1154 unassigned and reserved for future versions of this document (see 1155 Section 6.2). Since the overall syntax for these additional 1156 information values is not yet defined, a decoder that sees an 1157 additional information value that it does not understand cannot 1158 continue parsing. 1160 4.4. Other Decoding Errors 1162 A CBOR data item may be syntactically well-formed but present a 1163 problem with interpreting the data encoded in it in the CBOR data 1164 model. Generally speaking, a decoder that finds a data item with 1165 such a problem might issue a warning, might stop processing 1166 altogether, might handle the error and make the problematic value 1167 available to the application as such, or take some other type of 1168 action. 1170 Such problems might include: 1172 Duplicate keys in a map: Generic decoders (Section 4.2) make data 1173 available to applications using the native CBOR data model. That 1174 data model includes maps (key-value mappings with unique keys), 1175 not multimaps (key-value mappings where multiple entries can have 1176 the same key). Thus, a generic decoder that gets a CBOR map item 1177 that has duplicate keys will decode to a map with only one 1178 instance of that key, or it might stop processing altogether. On 1179 the other hand, a "streaming decoder" may not even be able to 1180 notice (Section 4.7). 1182 Inadmissible type on the value following a tag: Tags (Section 3.4) 1183 specify what type of data item is supposed to follow the tag; for 1184 example, the tags for positive or negative bignums are supposed to 1185 be put on byte strings. A decoder that decodes the tagged data 1186 item into a native representation (a native big integer in this 1187 example) is expected to check the type of the data item being 1188 tagged. Even decoders that don't have such native representations 1189 available in their environment may perform the check on those tags 1190 known to them and react appropriately. 1192 Invalid UTF-8 string: A decoder might or might not want to verify 1193 that the sequence of bytes in a UTF-8 string (major type 3) is 1194 actually valid UTF-8 and react appropriately. 1196 4.5. Handling Unknown Simple Values and Tags 1198 A decoder that comes across a simple value (Section 3.3) that it does 1199 not recognize, such as a value that was added to the IANA registry 1200 after the decoder was deployed or a value that the decoder chose not 1201 to implement, might issue a warning, might stop processing 1202 altogether, might handle the error by making the unknown value 1203 available to the application as such (as is expected of generic 1204 decoders), or take some other type of action. 1206 A decoder that comes across a tag (Section 3.4) that it does not 1207 recognize, such as a tag that was added to the IANA registry after 1208 the decoder was deployed or a tag that the decoder chose not to 1209 implement, might issue a warning, might stop processing altogether, 1210 might handle the error and present the unknown tag value together 1211 with the contained data item to the application (as is expected of 1212 generic decoders), might ignore the tag and simply present the 1213 contained data item only to the application, or take some other type 1214 of action. 1216 4.6. Numbers 1218 An application or protocol that uses CBOR might restrict the 1219 representations of numbers. For instance, a protocol that only deals 1220 with integers might say that floating-point numbers may not be used 1221 and that decoders of that protocol do not need to be able to handle 1222 floating-point numbers. Similarly, a protocol or application that 1223 uses CBOR might say that decoders need to be able to handle either 1224 type of number. 1226 CBOR-based protocols should take into account that different language 1227 environments pose different restrictions on the range and precision 1228 of numbers that are representable. For example, the JavaScript 1229 number system treats all numbers as floating point, which may result 1230 in silent loss of precision in decoding integers with more than 53 1231 significant bits. A protocol that uses numbers should define its 1232 expectations on the handling of non-trivial numbers in decoders and 1233 receiving applications. 1235 A CBOR-based protocol that includes floating-point numbers can 1236 restrict which of the three formats (half-precision, single- 1237 precision, and double-precision) are to be supported. For an 1238 integer-only application, a protocol may want to completely exclude 1239 the use of floating-point values. 1241 A CBOR-based protocol designed for compactness may want to exclude 1242 specific integer encodings that are longer than necessary for the 1243 application, such as to save the need to implement 64-bit integers. 1245 There is an expectation that encoders will use the most compact 1246 integer representation that can represent a given value. However, a 1247 compact application should accept values that use a longer-than- 1248 needed encoding (such as encoding "0" as 0b000_11001 followed by two 1249 bytes of 0x00) as long as the application can decode an integer of 1250 the given size. 1252 4.7. Specifying Keys for Maps 1254 The encoding and decoding applications need to agree on what types of 1255 keys are going to be used in maps. In applications that need to 1256 interwork with JSON-based applications, keys probably should be 1257 limited to UTF-8 strings only; otherwise, there has to be a specified 1258 mapping from the other CBOR types to Unicode characters, and this 1259 often leads to implementation errors. In applications where keys are 1260 numeric in nature and numeric ordering of keys is important to the 1261 application, directly using the numbers for the keys is useful. 1263 If multiple types of keys are to be used, consideration should be 1264 given to how these types would be represented in the specific 1265 programming environments that are to be used. For example, in 1266 JavaScript Maps [ECMA262], a key of integer 1 cannot be distinguished 1267 from a key of floating point 1.0. This means that, if integer keys 1268 are used, the protocol needs to avoid use of floating-point keys the 1269 values of which happen to be integer numbers in the same map. 1271 Decoders that deliver data items nested within a CBOR data item 1272 immediately on decoding them ("streaming decoders") often do not keep 1273 the state that is necessary to ascertain uniqueness of a key in a 1274 map. Similarly, an encoder that can start encoding data items before 1275 the enclosing data item is completely available ("streaming encoder") 1276 may want to reduce its overhead significantly by relying on its data 1277 source to maintain uniqueness. 1279 A CBOR-based protocol should make an intentional decision about what 1280 to do when a receiving application does see multiple identical keys 1281 in a map. The resulting rule in the protocol should respect the CBOR 1282 data model: it cannot prescribe a specific handling of the entries 1283 with the identical keys, except that it might have a rule that having 1284 identical keys in a map indicates a malformed map and that the 1285 decoder has to stop with an error. Duplicate keys are also 1286 prohibited by CBOR decoders that are using strict mode 1287 (Section 4.10). 1289 The CBOR data model for maps does not allow ascribing semantics to 1290 the order of the key/value pairs in the map representation. Thus, a 1291 CBOR-based protocol MUST NOT specify that changing the key/value pair 1292 order in a map would change the semantics, except to specify that 1293 some, e.g. non-canonical, orders are disallowed. Timing, cache 1294 usage, and other side channels are not considered part of the 1295 semantics. 1297 Applications for constrained devices that have maps with 24 or fewer 1298 frequently used keys should consider using small integers (and those 1299 with up to 48 frequently used keys should consider also using small 1300 negative integers) because the keys can then be encoded in a single 1301 byte. 1303 4.7.1. Equivalence of Keys 1305 This notion of equivalence must be used to determine whether keys in 1306 maps are duplicates or distinct. 1308 o All numbers are compared by their numeric value. 1310 * Integer data items with the same value are equal regardless of 1311 how many bytes are used to encode them. 1313 * Floating point data items with the same value are equal 1314 regardless of how many bytes are used to encode them. 1316 * An integer value encoded as a floating point data item is 1317 equivalent to the same value encoded as an integer 1319 o Byte strings and text strings are compared by their binary 1320 content. 1322 * A different length encoding has no effect on equivalence. 1324 * A byte string is equal to a text string if they have the same 1325 binary content. 1327 o Two arrays are equal if all their items are in the same order and 1328 equal. 1330 o Two maps are equal if they have the same set of pairs regardless 1331 of their order; pairs are equal if both the key and value are 1332 equal. 1334 o Tags have no effect in determining equality of a data item, if two 1335 items are equal then they are equal irrespective of any tags that 1336 either or both may have. 1338 o Simple values are equal if they simply have the same value. 1340 Nothing else is equal, a simple value 2 is not equivalent to an 1341 integer 2 and an array cannot be equivalent to a map with the same 1342 values and sequential integer keys. 1344 4.8. Undefined Values 1346 In some CBOR-based protocols, the simple value (Section 3.3) of 1347 Undefined might be used by an encoder as a substitute for a data item 1348 with an encoding problem, in order to allow the rest of the enclosing 1349 data items to be encoded without harm. 1351 4.9. Canonical CBOR 1353 Some protocols may want encoders to only emit CBOR in a particular 1354 canonical format; those protocols might also have the decoders check 1355 that their input is canonical. Those protocols are free to define 1356 what they mean by a canonical format and what encoders and decoders 1357 are expected to do. This section defines a set of restrictions that 1358 can serve as the base of such a canonical format. 1360 A CBOR encoding satisfies the "core canonicalization requirements" if 1361 it satisfies the following restrictions: 1363 o Integers MUST be as short as possible. In particular: 1365 * 0 to 23 and -1 to -24 MUST be expressed in the same byte as the 1366 major type; 1368 * 24 to 255 and -25 to -256 MUST be expressed only with an 1369 additional uint8_t; 1371 * 256 to 65535 and -257 to -65536 MUST be expressed only with an 1372 additional uint16_t; 1374 * 65536 to 4294967295 and -65537 to -4294967296 MUST be expressed 1375 only with an additional uint32_t. 1377 o The expression of lengths in major types 2 through 5 MUST be as 1378 short as possible. The rules for these lengths follow the above 1379 rule for integers. 1381 o The keys in every map MUST be sorted in the bytewise lexicographic 1382 order of their canonical encodings. For example, the following 1383 keys are sorted correctly: 1385 1. 10, encoded as 0x0a. 1387 2. 100, encoded as 0x1864. 1389 3. -1, encoded as 0x20. 1391 4. "z", encoded as 0x617a. 1393 5. "aa", encoded as 0x626161. 1395 6. [100], encoded as 0x811864. 1397 7. [-1], encoded as 0x8120. 1399 8. false, encoded as 0xf4. 1401 o Indefinite-length items MUST NOT appear. They can be encoded as 1402 definite-length items instead. 1404 If a protocol allows for IEEE floats, then additional 1405 canonicalization rules might need to be added. One example rule 1406 might be to have all floats start as a 64-bit float, then do a test 1407 conversion to a 32-bit float; if the result is the same numeric 1408 value, use the shorter value and repeat the process with a test 1409 conversion to a 16-bit float. (This rule selects 16-bit float for 1410 positive and negative Infinity as well.) Also, there are many 1411 representations for NaN. If NaN is an allowed value, it must always 1412 be represented as 0xf97e00. 1414 CBOR tags present additional considerations for canonicalization. 1415 The absence or presence of tags in a canonical format is determined 1416 by the optionality of the tags in the protocol. In a CBOR-based 1417 protocol that allows optional tagging anywhere, the canonical format 1418 must not allow them. In a protocol that requires tags in certain 1419 places, the tag needs to appear in the canonical format. A CBOR- 1420 based protocol that uses canonicalization might instead say that all 1421 tags that appear in a message must be retained regardless of whether 1422 they are optional. 1424 Protocols that include floating, big integer, or other complex values 1425 need to define extra requirements on their canonical encodings. For 1426 example: 1428 o If a protocol includes a field that can express floating values 1429 (Section 3.3), the protocol's canonicalization needs to specify 1430 whether the integer 1.0 is encoded as 0x01, 0xf93c00, 1431 0xfa3f800000, or 0xfb3ff0000000000000. Three sensible rules for 1432 this are: 1434 1. Encode integral values that fit in 64 bits as values from 1435 major types 0 and 1, and other values as the smallest of 16-, 1436 32-, or 64-bit floating point that accurately represents the 1437 value, 1439 2. Encode all values as the smallest of 16-, 32-, or 64-bit 1440 floating point that accurately represents the value, even for 1441 integral values, or 1443 3. Encode all values as 64-bit floating point. 1445 If NaN is an allowed value, the protocol needs to pick a single 1446 representation, for example 0xf97e00. 1448 o If a protocol includes a field that can express integers larger 1449 than 2^64 using tag 2 (Section 3.4.2), the protocol's 1450 canonicalization needs to specify whether small integers are 1451 expressed using the tag or major types 0 and 1. 1453 o A protocol might give encoders the choice of representing a URL as 1454 either a text string or, using Section 3.4.4.3, tag 32 containing 1455 a text string. This protocol's canonicalization needs to either 1456 require that the tag is present or require that it's absent, not 1457 allow either one. 1459 4.9.1. Length-first map key ordering 1461 The core canonicalization requirements sort map keys in a different 1462 order from the one suggested by [RFC7049]. Protocols that need to be 1463 compatible with [RFC7049]'s order can instead be specified in terms 1464 of this specification's "length-first core canonicalization 1465 requirements": 1467 A CBOR encoding satisfies the "length-first core canonicalization 1468 requirements" if it satisfies the core canonicalization requirements 1469 except that the keys in every map MUST be sorted such that: 1471 1. If two keys have different lengths, the shorter one sorts 1472 earlier; 1474 2. If two keys have the same length, the one with the lower value in 1475 (byte-wise) lexical order sorts earlier. 1477 For example, under the length-first core canonicalization 1478 requirements, the following keys are sorted correctly: 1480 1. 10, encoded as 0x0a. 1482 2. -1, encoded as 0x20. 1484 3. false, encoded as 0xf4. 1486 4. 100, encoded as 0x1864. 1488 5. "z", encoded as 0x617a. 1490 6. [-1], encoded as 0x8120. 1492 7. "aa", encoded as 0x626161. 1494 8. [100], encoded as 0x811864. 1496 4.10. Strict Mode 1498 Some areas of application of CBOR do not require canonicalization 1499 (Section 4.9) but may require that different decoders reach the same 1500 (semantically equivalent) results, even in the presence of 1501 potentially malicious data. This can be required if one application 1502 (such as a firewall or other protecting entity) makes a decision 1503 based on the data that another application, which independently 1504 decodes the data, relies on. 1506 Normally, it is the responsibility of the sender to avoid ambiguously 1507 decodable data. However, the sender might be an attacker specially 1508 making up CBOR data such that it will be interpreted differently by 1509 different decoders in an attempt to exploit that as a vulnerability. 1510 Generic decoders used in applications where this might be a problem 1511 need to support a strict mode in which it is also the responsibility 1512 of the receiver to reject ambiguously decodable data. It is expected 1513 that firewalls and other security systems that decode CBOR will only 1514 decode in strict mode. 1516 A decoder in strict mode will reliably reject any data that could be 1517 interpreted by other decoders in different ways. It will reliably 1518 reject data items with syntax errors (Section 4.3). It will also 1519 expend the effort to reliably detect other decoding errors 1520 (Section 4.4). In particular, a strict decoder needs to have an API 1521 that reports an error (and does not return data) for a CBOR data item 1522 that contains any of the following: 1524 o a map (major type 5) that has more than one entry with the same 1525 key 1527 o a tag that is used on a data item of the incorrect type 1529 o a data item that is incorrectly formatted for the type given to 1530 it, such as invalid UTF-8 or data that cannot be interpreted with 1531 the specific tag that it has been tagged with 1533 A decoder in strict mode can do one of two things when it encounters 1534 a tag or simple value that it does not recognize: 1536 o It can report an error (and not return data). 1538 o It can emit the unknown item (type, value, and, for tags, the 1539 decoded tagged data item) to the application calling the decoder 1540 with an indication that the decoder did not recognize that tag or 1541 simple value. 1543 The latter approach, which is also appropriate for non-strict 1544 decoders, supports forward compatibility with newly registered tags 1545 and simple values without the requirement to update the encoder at 1546 the same time as the calling application. (For this, the API for the 1547 decoder needs to have a way to mark unknown items so that the calling 1548 application can handle them in a manner appropriate for the program.) 1550 Since some of this processing may have an appreciable cost (in 1551 particular with duplicate detection for maps), support of strict mode 1552 is not a requirement placed on all CBOR decoders. 1554 Some encoders will rely on their applications to provide input data 1555 in such a way that unambiguously decodable CBOR results. A generic 1556 encoder also may want to provide a strict mode where it reliably 1557 limits its output to unambiguously decodable CBOR, independent of 1558 whether or not its application is providing API-conformant data. 1560 5. Converting Data between CBOR and JSON 1562 This section gives non-normative advice about converting between CBOR 1563 and JSON. Implementations of converters are free to use whichever 1564 advice here they want. 1566 It is worth noting that a JSON text is a sequence of characters, not 1567 an encoded sequence of bytes, while a CBOR data item consists of 1568 bytes, not characters. 1570 5.1. Converting from CBOR to JSON 1572 Most of the types in CBOR have direct analogs in JSON. However, some 1573 do not, and someone implementing a CBOR-to-JSON converter has to 1574 consider what to do in those cases. The following non-normative 1575 advice deals with these by converting them to a single substitute 1576 value, such as a JSON null. 1578 o An integer (major type 0 or 1) becomes a JSON number. 1580 o A byte string (major type 2) that is not embedded in a tag that 1581 specifies a proposed encoding is encoded in base64url without 1582 padding and becomes a JSON string. 1584 o A UTF-8 string (major type 3) becomes a JSON string. Note that 1585 JSON requires escaping certain characters ([RFC8259], Section 7): 1586 quotation mark (U+0022), reverse solidus (U+005C), and the "C0 1587 control characters" (U+0000 through U+001F). All other characters 1588 are copied unchanged into the JSON UTF-8 string. 1590 o An array (major type 4) becomes a JSON array. 1592 o A map (major type 5) becomes a JSON object. This is possible 1593 directly only if all keys are UTF-8 strings. A converter might 1594 also convert other keys into UTF-8 strings (such as by converting 1595 integers into strings containing their decimal representation); 1596 however, doing so introduces a danger of key collision. 1598 o False (major type 7, additional information 20) becomes a JSON 1599 false. 1601 o True (major type 7, additional information 21) becomes a JSON 1602 true. 1604 o Null (major type 7, additional information 22) becomes a JSON 1605 null. 1607 o A floating-point value (major type 7, additional information 25 1608 through 27) becomes a JSON number if it is finite (that is, it can 1609 be represented in a JSON number); if the value is non-finite (NaN, 1610 or positive or negative Infinity), it is represented by the 1611 substitute value. 1613 o Any other simple value (major type 7, any additional information 1614 value not yet discussed) is represented by the substitute value. 1616 o A bignum (major type 6, tag value 2 or 3) is represented by 1617 encoding its byte string in base64url without padding and becomes 1618 a JSON string. For tag value 3 (negative bignum), a "~" (ASCII 1619 tilde) is inserted before the base-encoded value. (The conversion 1620 to a binary blob instead of a number is to prevent a likely 1621 numeric overflow for the JSON decoder.) 1623 o A byte string with an encoding hint (major type 6, tag value 21 1624 through 23) is encoded as described and becomes a JSON string. 1626 o For all other tags (major type 6, any other tag value), the 1627 embedded CBOR item is represented as a JSON value; the tag value 1628 is ignored. 1630 o Indefinite-length items are made definite before conversion. 1632 5.2. Converting from JSON to CBOR 1634 All JSON values, once decoded, directly map into one or more CBOR 1635 values. As with any kind of CBOR generation, decisions have to be 1636 made with respect to number representation. In a suggested 1637 conversion: 1639 o JSON numbers without fractional parts (integer numbers) are 1640 represented as integers (major types 0 and 1, possibly major type 1641 6 tag value 2 and 3), choosing the shortest form; integers longer 1642 than an implementation-defined threshold (which is usually either 1643 32 or 64 bits) may instead be represented as floating-point 1644 values. (If the JSON was generated from a JavaScript 1645 implementation, its precision is already limited to 53 bits 1646 maximum.) 1648 o Numbers with fractional parts are represented as floating-point 1649 values. Preferably, the shortest exact floating-point 1650 representation is used; for instance, 1.5 is represented in a 1651 16-bit floating-point value (not all implementations will be 1652 capable of efficiently finding the minimum form, though). There 1653 may be an implementation-defined limit to the precision that will 1654 affect the precision of the represented values. Decimal 1655 representation should only be used if that is specified in a 1656 protocol. 1658 CBOR has been designed to generally provide a more compact encoding 1659 than JSON. One implementation strategy that might come to mind is to 1660 perform a JSON-to-CBOR encoding in place in a single buffer. This 1661 strategy would need to carefully consider a number of pathological 1662 cases, such as that some strings represented with no or very few 1663 escapes and longer (or much longer) than 255 bytes may expand when 1664 encoded as UTF-8 strings in CBOR. Similarly, a few of the binary 1665 floating-point representations might cause expansion from some short 1666 decimal representations (1.1, 1e9) in JSON. This may be hard to get 1667 right, and any ensuing vulnerabilities may be exploited by an 1668 attacker. 1670 6. Future Evolution of CBOR 1672 Successful protocols evolve over time. New ideas appear, 1673 implementation platforms improve, related protocols are developed and 1674 evolve, and new requirements from applications and protocols are 1675 added. Facilitating protocol evolution is therefore an important 1676 design consideration for any protocol development. 1678 For protocols that will use CBOR, CBOR provides some useful 1679 mechanisms to facilitate their evolution. Best practices for this 1680 are well known, particularly from JSON format development of JSON- 1681 based protocols. Therefore, such best practices are outside the 1682 scope of this specification. 1684 However, facilitating the evolution of CBOR itself is very well 1685 within its scope. CBOR is designed to both provide a stable basis 1686 for development of CBOR-based protocols and to be able to evolve. 1687 Since a successful protocol may live for decades, CBOR needs to be 1688 designed for decades of use and evolution. This section provides 1689 some guidance for the evolution of CBOR. It is necessarily more 1690 subjective than other parts of this document. It is also necessarily 1691 incomplete, lest it turn into a textbook on protocol development. 1693 6.1. Extension Points 1695 In a protocol design, opportunities for evolution are often included 1696 in the form of extension points. For example, there may be a 1697 codepoint space that is not fully allocated from the outset, and the 1698 protocol is designed to tolerate and embrace implementations that 1699 start using more codepoints than initially allocated. 1701 Sizing the codepoint space may be difficult because the range 1702 required may be hard to predict. An attempt should be made to make 1703 the codepoint space large enough so that it can slowly be filled over 1704 the intended lifetime of the protocol. 1706 CBOR has three major extension points: 1708 o the "simple" space (values in major type 7). Of the 24 efficient 1709 (and 224 slightly less efficient) values, only a small number have 1710 been allocated. Implementations receiving an unknown simple data 1711 item may be able to process it as such, given that the structure 1712 of the value is indeed simple. The IANA registry in Section 8.1 1713 is the appropriate way to address the extensibility of this 1714 codepoint space. 1716 o the "tag" space (values in major type 6). Again, only a small 1717 part of the codepoint space has been allocated, and the space is 1718 abundant (although the early numbers are more efficient than the 1719 later ones). Implementations receiving an unknown tag can choose 1720 to simply ignore it or to process it as an unknown tag wrapping 1721 the following data item. The IANA registry in Section 8.2 is the 1722 appropriate way to address the extensibility of this codepoint 1723 space. 1725 o the "additional information" space. An implementation receiving 1726 an unknown additional information value has no way to continue 1727 parsing, so allocating codepoints to this space is a major step. 1728 There are also very few codepoints left. 1730 6.2. Curating the Additional Information Space 1732 The human mind is sometimes drawn to filling in little perceived gaps 1733 to make something neat. We expect the remaining gaps in the 1734 codepoint space for the additional information values to be an 1735 attractor for new ideas, just because they are there. 1737 The present specification does not manage the additional information 1738 codepoint space by an IANA registry. Instead, allocations out of 1739 this space can only be done by updating this specification. 1741 For an additional information value of n >= 24, the size of the 1742 additional data typically is 2**(n-24) bytes. Therefore, additional 1743 information values 28 and 29 should be viewed as candidates for 1744 128-bit and 256-bit quantities, in case a need arises to add them to 1745 the protocol. Additional information value 30 is then the only 1746 additional information value available for general allocation, and 1747 there should be a very good reason for allocating it before assigning 1748 it through an update of this protocol. 1750 7. Diagnostic Notation 1752 CBOR is a binary interchange format. To facilitate documentation and 1753 debugging, and in particular to facilitate communication between 1754 entities cooperating in debugging, this section defines a simple 1755 human-readable diagnostic notation. All actual interchange always 1756 happens in the binary format. 1758 Note that this truly is a diagnostic format; it is not meant to be 1759 parsed. Therefore, no formal definition (as in ABNF) is given in 1760 this document. (Implementers looking for a text-based format for 1761 representing CBOR data items in configuration files may also want to 1762 consider YAML [YAML].) 1764 The diagnostic notation is loosely based on JSON as it is defined in 1765 RFC 8259, extending it where needed. 1767 The notation borrows the JSON syntax for numbers (integer and 1768 floating point), True (>true<), False (>false<), Null (>null<), UTF-8 1769 strings, arrays, and maps (maps are called objects in JSON; the 1770 diagnostic notation extends JSON here by allowing any data item in 1771 the key position). Undefined is written >undefined< as in 1772 JavaScript. The non-finite floating-point numbers Infinity, 1773 -Infinity, and NaN are written exactly as in this sentence (this is 1774 also a way they can be written in JavaScript, although JSON does not 1775 allow them). A tagged item is written as an integer number for the 1776 tag followed by the item in parentheses; for instance, an RFC 3339 1777 (ISO 8601) date could be notated as: 1779 0("2013-03-21T20:04:00Z") 1781 or the equivalent relative time as 1783 1(1363896240) 1785 Byte strings are notated in one of the base encodings, without 1786 padding, enclosed in single quotes, prefixed by >h< for base16, >b32< 1787 for base32, >h32< for base32hex, >b64< for base64 or base64url (the 1788 actual encodings do not overlap, so the string remains unambiguous). 1789 For example, the byte string 0x12345678 could be written h'12345678', 1790 b32'CI2FM6A', or b64'EjRWeA'. 1792 Unassigned simple values are given as "simple()" with the appropriate 1793 integer in the parentheses. For example, "simple(42)" indicates 1794 major type 7, value 42. 1796 7.1. Encoding Indicators 1798 Sometimes it is useful to indicate in the diagnostic notation which 1799 of several alternative representations were actually used; for 1800 example, a data item written >1.5< by a diagnostic decoder might have 1801 been encoded as a half-, single-, or double-precision float. 1803 The convention for encoding indicators is that anything starting with 1804 an underscore and all following characters that are alphanumeric or 1805 underscore, is an encoding indicator, and can be ignored by anyone 1806 not interested in this information. Encoding indicators are always 1807 optional. 1809 A single underscore can be written after the opening brace of a map 1810 or the opening bracket of an array to indicate that the data item was 1811 represented in indefinite-length format. For example, [_ 1, 2] 1812 contains an indicator that an indefinite-length representation was 1813 used to represent the data item [1, 2]. 1815 An underscore followed by a decimal digit n indicates that the 1816 preceding item (or, for arrays and maps, the item starting with the 1817 preceding bracket or brace) was encoded with an additional 1818 information value of 24+n. For example, 1.5_1 is a half-precision 1819 floating-point number, while 1.5_3 is encoded as double precision. 1820 This encoding indicator is not shown in Appendix A. (Note that the 1821 encoding indicator "_" is thus an abbreviation of the full form "_7", 1822 which is not used.) 1824 As a special case, byte and text strings of indefinite length can be 1825 notated in the form (_ h'0123', h'4567') and (_ "foo", "bar"). 1827 8. IANA Considerations 1829 IANA has created two registries for new CBOR values. The registries 1830 are separate, that is, not under an umbrella registry, and follow the 1831 rules in [RFC8126]. IANA has also assigned a new MIME media type and 1832 an associated Constrained Application Protocol (CoAP) Content-Format 1833 entry. 1835 8.1. Simple Values Registry 1837 IANA has created the "Concise Binary Object Representation (CBOR) 1838 Simple Values" registry. The initial values are shown in Table 2. 1840 New entries in the range 0 to 19 are assigned by Standards Action. 1841 It is suggested that these Standards Actions allocate values starting 1842 with the number 16 in order to reserve the lower numbers for 1843 contiguous blocks (if any). 1845 New entries in the range 32 to 255 are assigned by Specification 1846 Required. 1848 8.2. Tags Registry 1850 IANA has created the "Concise Binary Object Representation (CBOR) 1851 Tags" registry. The initial values are shown in Table 3. 1853 New entries in the range 0 to 23 are assigned by Standards Action. 1854 New entries in the range 24 to 255 are assigned by Specification 1855 Required. New entries in the range 256 to 18446744073709551615 are 1856 assigned by First Come First Served. The template for registration 1857 requests is: 1859 o Data item 1861 o Semantics (short form) 1862 In addition, First Come First Served requests should include: 1864 o Point of contact 1866 o Description of semantics (URL) - This description is optional; the 1867 URL can point to something like an Internet-Draft or a web page. 1869 8.3. Media Type ("MIME Type") 1871 The Internet media type [RFC6838] for CBOR data is application/cbor. 1873 Type name: application 1875 Subtype name: cbor 1877 Required parameters: n/a 1879 Optional parameters: n/a 1881 Encoding considerations: binary 1883 Security considerations: See Section 9 of this document 1885 Interoperability considerations: n/a 1887 Published specification: This document 1889 Applications that use this media type: None yet, but it is expected 1890 that this format will be deployed in protocols and applications. 1892 Additional information: 1893 Magic number(s): n/a 1894 File extension(s): .cbor 1895 Macintosh file type code(s): n/a 1897 Person & email address to contact for further information: 1898 Carsten Bormann 1899 cabo@tzi.org 1901 Intended usage: COMMON 1903 Restrictions on usage: none 1905 Author: 1906 Carsten Bormann 1908 Change controller: 1909 The IESG 1911 8.4. CoAP Content-Format 1913 Media Type: application/cbor 1915 Encoding: - 1917 Id: 60 1919 Reference: [RFCthis] 1921 8.5. The +cbor Structured Syntax Suffix Registration 1923 Name: Concise Binary Object Representation (CBOR) 1925 +suffix: +cbor 1927 References: [RFCthis] 1929 Encoding Considerations: CBOR is a binary format. 1931 Interoperability Considerations: n/a 1933 Fragment Identifier Considerations: 1934 The syntax and semantics of fragment identifiers specified for 1935 +cbor SHOULD be as specified for "application/cbor". (At 1936 publication of this document, there is no fragment identification 1937 syntax defined for "application/cbor".) 1939 The syntax and semantics for fragment identifiers for a specific 1940 "xxx/yyy+cbor" SHOULD be processed as follows: 1942 For cases defined in +cbor, where the fragment identifier resolves 1943 per the +cbor rules, then process as specified in +cbor. 1945 For cases defined in +cbor, where the fragment identifier does 1946 not resolve per the +cbor rules, then process as specified in 1947 "xxx/yyy+cbor". 1949 For cases not defined in +cbor, then process as specified in 1950 "xxx/yyy+cbor". 1952 Security Considerations: See Section 9 of this document 1953 Contact: 1954 Apps Area Working Group (apps-discuss@ietf.org) 1956 Author/Change Controller: 1957 The Apps Area Working Group. 1958 The IESG has change control over this registration. 1960 9. Security Considerations 1962 A network-facing application can exhibit vulnerabilities in its 1963 processing logic for incoming data. Complex parsers are well known 1964 as a likely source of such vulnerabilities, such as the ability to 1965 remotely crash a node, or even remotely execute arbitrary code on it. 1966 CBOR attempts to narrow the opportunities for introducing such 1967 vulnerabilities by reducing parser complexity, by giving the entire 1968 range of encodable values a meaning where possible. 1970 Resource exhaustion attacks might attempt to lure a decoder into 1971 allocating very big data items (strings, arrays, maps) or exhaust the 1972 stack depth by setting up deeply nested items. Decoders need to have 1973 appropriate resource management to mitigate these attacks. (Items 1974 for which very large sizes are given can also attempt to exploit 1975 integer overflow vulnerabilities.) 1977 Applications where a CBOR data item is examined by a gatekeeper 1978 function and later used by a different application may exhibit 1979 vulnerabilities when multiple interpretations of the data item are 1980 possible. For example, an attacker could make use of duplicate keys 1981 in maps and precision issues in numbers to make the gatekeeper base 1982 its decisions on a different interpretation than the one that will be 1983 used by the second application. Protocols that are used in a 1984 security context should be defined in such a way that these multiple 1985 interpretations are reliably reduced to a single one. To facilitate 1986 this, encoder and decoder implementations used in such contexts 1987 should provide at least one strict mode of operation (Section 4.10). 1989 10. Acknowledgements 1991 CBOR was inspired by MessagePack. MessagePack was developed and 1992 promoted by Sadayuki Furuhashi ("frsyuki"). This reference to 1993 MessagePack is solely for attribution; CBOR is not intended as a 1994 version of or replacement for MessagePack, as it has different design 1995 goals and requirements. 1997 The need for functionality beyond the original MessagePack 1998 Specification became obvious to many people at about the same time 1999 around the year 2012. BinaryPack is a minor derivation of 2000 MessagePack that was developed by Eric Zhang for the binaryjs 2001 project. A similar, but different, extension was made by Tim Caswell 2002 for his msgpack-js and msgpack-js-browser projects. Many people have 2003 contributed to the recent discussion about extending MessagePack to 2004 separate text string representation from byte string representation. 2006 The encoding of the additional information in CBOR was inspired by 2007 the encoding of length information designed by Klaus Hartke for CoAP. 2009 This document also incorporates suggestions made by many people, 2010 notably Dan Frost, James Manger, Joe Hildebrand, Keith Moore, Matthew 2011 Lepinski, Nico Williams, Phillip Hallam-Baker, Ray Polk, Tim Bray, 2012 Tony Finch, Tony Hansen, and Yaron Sheffer. 2014 11. References 2016 11.1. Normative References 2018 [ECMA262] Ecma International, "ECMAScript 2018 Language 2019 Specification", ECMA Standard ECMA-262, 9th Edition, June 2020 2018, . 2024 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 2025 Extensions (MIME) Part One: Format of Internet Message 2026 Bodies", RFC 2045, DOI 10.17487/RFC2045, November 1996, 2027 . 2029 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2030 Requirement Levels", BCP 14, RFC 2119, 2031 DOI 10.17487/RFC2119, March 1997, 2032 . 2034 [RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet: 2035 Timestamps", RFC 3339, DOI 10.17487/RFC3339, July 2002, 2036 . 2038 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 2039 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 2040 2003, . 2042 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 2043 Resource Identifier (URI): Generic Syntax", STD 66, 2044 RFC 3986, DOI 10.17487/RFC3986, January 2005, 2045 . 2047 [RFC4287] Nottingham, M., Ed. and R. Sayre, Ed., "The Atom 2048 Syndication Format", RFC 4287, DOI 10.17487/RFC4287, 2049 December 2005, . 2051 [RFC4648] Josefsson, S., "The Base16, Base32, and Base64 Data 2052 Encodings", RFC 4648, DOI 10.17487/RFC4648, October 2006, 2053 . 2055 [RFC8126] Cotton, M., Leiba, B., and T. Narten, "Guidelines for 2056 Writing an IANA Considerations Section in RFCs", BCP 26, 2057 RFC 8126, DOI 10.17487/RFC8126, June 2017, 2058 . 2060 [TIME_T] The Open Group Base Specifications, "Vol. 1: Base 2061 Definitions, Issue 7", Section 4.15 'Seconds Since the 2062 Epoch', IEEE Std 1003.1, 2013 Edition, 2013, 2063 . 2066 11.2. Informative References 2068 [ASN.1] International Telecommunication Union, "Information 2069 Technology -- ASN.1 encoding rules: Specification of Basic 2070 Encoding Rules (BER), Canonical Encoding Rules (CER) and 2071 Distinguished Encoding Rules (DER)", ITU-T Recommendation 2072 X.690, 1994. 2074 [BSON] Various, "BSON - Binary JSON", 2013, 2075 . 2077 [MessagePack] 2078 Furuhashi, S., "MessagePack", 2013, . 2080 [PCRE] Hazel, P., "PCRE - Perl Compatible Regular Expressions", 2081 2018, . 2083 [RFC0713] Haverty, J., "MSDTP-Message Services Data Transmission 2084 Protocol", RFC 713, DOI 10.17487/RFC0713, April 1976, 2085 . 2087 [RFC6838] Freed, N., Klensin, J., and T. Hansen, "Media Type 2088 Specifications and Registration Procedures", BCP 13, 2089 RFC 6838, DOI 10.17487/RFC6838, January 2013, 2090 . 2092 [RFC7049] Bormann, C. and P. Hoffman, "Concise Binary Object 2093 Representation (CBOR)", RFC 7049, DOI 10.17487/RFC7049, 2094 October 2013, . 2096 [RFC7228] Bormann, C., Ersue, M., and A. Keranen, "Terminology for 2097 Constrained-Node Networks", RFC 7228, 2098 DOI 10.17487/RFC7228, May 2014, 2099 . 2101 [RFC8259] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data 2102 Interchange Format", STD 90, RFC 8259, 2103 DOI 10.17487/RFC8259, December 2017, 2104 . 2106 [UBJSON] The Buzz Media, "Universal Binary JSON Specification", 2107 2013, . 2109 [YAML] Ben-Kiki, O., Evans, C., and I. Net, "YAML Ain't Markup 2110 Language (YAML[TM]) Version 1.2", 3rd Edition, October 2111 2009, . 2113 Appendix A. Examples 2115 The following table provides some CBOR-encoded values in hexadecimal 2116 (right column), together with diagnostic notation for these values 2117 (left column). Note that the string "\u00fc" is one form of 2118 diagnostic notation for a UTF-8 string containing the single Unicode 2119 character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (u umlaut). 2120 Similarly, "\u6c34" is a UTF-8 string in diagnostic notation with a 2121 single character U+6C34 (CJK UNIFIED IDEOGRAPH-6C34, often 2122 representing "water"), and "\ud800\udd51" is a UTF-8 string in 2123 diagnostic notation with a single character U+10151 (GREEK ACROPHONIC 2124 ATTIC FIFTY STATERS). (Note that all these single-character strings 2125 could also be represented in native UTF-8 in diagnostic notation, 2126 just not in an ASCII-only specification like the present one.) In 2127 the diagnostic notation provided for bignums, their intended numeric 2128 value is shown as a decimal number (such as 18446744073709551616) 2129 instead of showing a tagged byte string (such as 2130 2(h'010000000000000000')). 2132 +------------------------------+------------------------------------+ 2133 | Diagnostic | Encoded | 2134 +------------------------------+------------------------------------+ 2135 | 0 | 0x00 | 2136 | | | 2137 | 1 | 0x01 | 2138 | | | 2139 | 10 | 0x0a | 2140 | | | 2141 | 23 | 0x17 | 2142 | | | 2143 | 24 | 0x1818 | 2144 | | | 2145 | 25 | 0x1819 | 2146 | | | 2147 | 100 | 0x1864 | 2148 | | | 2149 | 1000 | 0x1903e8 | 2150 | | | 2151 | 1000000 | 0x1a000f4240 | 2152 | | | 2153 | 1000000000000 | 0x1b000000e8d4a51000 | 2154 | | | 2155 | 18446744073709551615 | 0x1bffffffffffffffff | 2156 | | | 2157 | 18446744073709551616 | 0xc249010000000000000000 | 2158 | | | 2159 | -18446744073709551616 | 0x3bffffffffffffffff | 2160 | | | 2161 | -18446744073709551617 | 0xc349010000000000000000 | 2162 | | | 2163 | -1 | 0x20 | 2164 | | | 2165 | -10 | 0x29 | 2166 | | | 2167 | -100 | 0x3863 | 2168 | | | 2169 | -1000 | 0x3903e7 | 2170 | | | 2171 | 0.0 | 0xf90000 | 2172 | | | 2173 | -0.0 | 0xf98000 | 2174 | | | 2175 | 1.0 | 0xf93c00 | 2176 | | | 2177 | 1.1 | 0xfb3ff199999999999a | 2178 | | | 2179 | 1.5 | 0xf93e00 | 2180 | | | 2181 | 65504.0 | 0xf97bff | 2182 | | | 2183 | 100000.0 | 0xfa47c35000 | 2184 | | | 2185 | 3.4028234663852886e+38 | 0xfa7f7fffff | 2186 | | | 2187 | 1.0e+300 | 0xfb7e37e43c8800759c | 2188 | | | 2189 | 5.960464477539063e-8 | 0xf90001 | 2190 | | | 2191 | 0.00006103515625 | 0xf90400 | 2192 | | | 2193 | -4.0 | 0xf9c400 | 2194 | | | 2195 | -4.1 | 0xfbc010666666666666 | 2196 | | | 2197 | Infinity | 0xf97c00 | 2198 | | | 2199 | NaN | 0xf97e00 | 2200 | | | 2201 | -Infinity | 0xf9fc00 | 2202 | | | 2203 | Infinity | 0xfa7f800000 | 2204 | | | 2205 | NaN | 0xfa7fc00000 | 2206 | | | 2207 | -Infinity | 0xfaff800000 | 2208 | | | 2209 | Infinity | 0xfb7ff0000000000000 | 2210 | | | 2211 | NaN | 0xfb7ff8000000000000 | 2212 | | | 2213 | -Infinity | 0xfbfff0000000000000 | 2214 | | | 2215 | false | 0xf4 | 2216 | | | 2217 | true | 0xf5 | 2218 | | | 2219 | null | 0xf6 | 2220 | | | 2221 | undefined | 0xf7 | 2222 | | | 2223 | simple(16) | 0xf0 | 2224 | | | 2225 | simple(24) | 0xf818 | 2226 | | | 2227 | simple(255) | 0xf8ff | 2228 | | | 2229 | 0("2013-03-21T20:04:00Z") | 0xc074323031332d30332d32315432303a | 2230 | | 30343a30305a | 2231 | | | 2232 | 1(1363896240) | 0xc11a514b67b0 | 2233 | | | 2234 | 1(1363896240.5) | 0xc1fb41d452d9ec200000 | 2235 | | | 2236 | 23(h'01020304') | 0xd74401020304 | 2237 | | | 2238 | 24(h'6449455446') | 0xd818456449455446 | 2239 | | | 2240 | 32("http://www.example.com") | 0xd82076687474703a2f2f7777772e6578 | 2241 | | 616d706c652e636f6d | 2242 | | | 2243 | h'' | 0x40 | 2244 | | | 2245 | h'01020304' | 0x4401020304 | 2246 | | | 2247 | "" | 0x60 | 2248 | | | 2249 | "a" | 0x6161 | 2250 | | | 2251 | "IETF" | 0x6449455446 | 2252 | | | 2253 | "\"\\" | 0x62225c | 2254 | | | 2255 | "\u00fc" | 0x62c3bc | 2256 | | | 2257 | "\u6c34" | 0x63e6b0b4 | 2258 | | | 2259 | "\ud800\udd51" | 0x64f0908591 | 2260 | | | 2261 | [] | 0x80 | 2262 | | | 2263 | [1, 2, 3] | 0x83010203 | 2264 | | | 2265 | [1, [2, 3], [4, 5]] | 0x8301820203820405 | 2266 | | | 2267 | [1, 2, 3, 4, 5, 6, 7, 8, 9, | 0x98190102030405060708090a0b0c0d0e | 2268 | 10, 11, 12, 13, 14, 15, 16, | 0f101112131415161718181819 | 2269 | 17, 18, 19, 20, 21, 22, 23, | | 2270 | 24, 25] | | 2271 | | | 2272 | {} | 0xa0 | 2273 | | | 2274 | {1: 2, 3: 4} | 0xa201020304 | 2275 | | | 2276 | {"a": 1, "b": [2, 3]} | 0xa26161016162820203 | 2277 | | | 2278 | ["a", {"b": "c"}] | 0x826161a161626163 | 2279 | | | 2280 | {"a": "A", "b": "B", "c": | 0xa5616161416162614261636143616461 | 2281 | "C", "d": "D", "e": "E"} | 4461656145 | 2282 | | | 2283 | (_ h'0102', h'030405') | 0x5f42010243030405ff | 2284 | | | 2285 | (_ "strea", "ming") | 0x7f657374726561646d696e67ff | 2286 | | | 2287 | [_ ] | 0x9fff | 2288 | | | 2289 | [_ 1, [2, 3], [_ 4, 5]] | 0x9f018202039f0405ffff | 2290 | | | 2291 | [_ 1, [2, 3], [4, 5]] | 0x9f01820203820405ff | 2292 | | | 2293 | [1, [2, 3], [_ 4, 5]] | 0x83018202039f0405ff | 2294 | | | 2295 | [1, [_ 2, 3], [4, 5]] | 0x83019f0203ff820405 | 2296 | | | 2297 | [_ 1, 2, 3, 4, 5, 6, 7, 8, | 0x9f0102030405060708090a0b0c0d0e0f | 2298 | 9, 10, 11, 12, 13, 14, 15, | 101112131415161718181819ff | 2299 | 16, 17, 18, 19, 20, 21, 22, | | 2300 | 23, 24, 25] | | 2301 | | | 2302 | {_ "a": 1, "b": [_ 2, 3]} | 0xbf61610161629f0203ffff | 2303 | | | 2304 | ["a", {_ "b": "c"}] | 0x826161bf61626163ff | 2305 | | | 2306 | {_ "Fun": true, "Amt": -2} | 0xbf6346756ef563416d7421ff | 2307 +------------------------------+------------------------------------+ 2309 Table 4: Examples of Encoded CBOR Data Items 2311 Appendix B. Jump Table 2313 For brevity, this jump table does not show initial bytes that are 2314 reserved for future extension. It also only shows a selection of the 2315 initial bytes that can be used for optional features. (All unsigned 2316 integers are in network byte order.) 2318 +------------+------------------------------------------------------+ 2319 | Byte | Structure/Semantics | 2320 +------------+------------------------------------------------------+ 2321 | 0x00..0x17 | Integer 0x00..0x17 (0..23) | 2322 | | | 2323 | 0x18 | Unsigned integer (one-byte uint8_t follows) | 2324 | | | 2325 | 0x19 | Unsigned integer (two-byte uint16_t follows) | 2326 | | | 2327 | 0x1a | Unsigned integer (four-byte uint32_t follows) | 2328 | | | 2329 | 0x1b | Unsigned integer (eight-byte uint64_t follows) | 2330 | | | 2331 | 0x20..0x37 | Negative integer -1-0x00..-1-0x17 (-1..-24) | 2332 | | | 2333 | 0x38 | Negative integer -1-n (one-byte uint8_t for n | 2334 | | follows) | 2335 | | | 2336 | 0x39 | Negative integer -1-n (two-byte uint16_t for n | 2337 | | follows) | 2338 | | | 2339 | 0x3a | Negative integer -1-n (four-byte uint32_t for n | 2340 | | follows) | 2341 | | | 2342 | 0x3b | Negative integer -1-n (eight-byte uint64_t for n | 2343 | | follows) | 2344 | | | 2345 | 0x40..0x57 | byte string (0x00..0x17 bytes follow) | 2346 | | | 2347 | 0x58 | byte string (one-byte uint8_t for n, and then n | 2348 | | bytes follow) | 2349 | | | 2350 | 0x59 | byte string (two-byte uint16_t for n, and then n | 2351 | | bytes follow) | 2352 | | | 2353 | 0x5a | byte string (four-byte uint32_t for n, and then n | 2354 | | bytes follow) | 2355 | | | 2356 | 0x5b | byte string (eight-byte uint64_t for n, and then n | 2357 | | bytes follow) | 2358 | | | 2359 | 0x5f | byte string, byte strings follow, terminated by | 2360 | | "break" | 2361 | | | 2362 | 0x60..0x77 | UTF-8 string (0x00..0x17 bytes follow) | 2363 | | | 2364 | 0x78 | UTF-8 string (one-byte uint8_t for n, and then n | 2365 | | bytes follow) | 2366 | | | 2367 | 0x79 | UTF-8 string (two-byte uint16_t for n, and then n | 2368 | | bytes follow) | 2369 | | | 2370 | 0x7a | UTF-8 string (four-byte uint32_t for n, and then n | 2371 | | bytes follow) | 2372 | | | 2373 | 0x7b | UTF-8 string (eight-byte uint64_t for n, and then n | 2374 | | bytes follow) | 2375 | | | 2376 | 0x7f | UTF-8 string, UTF-8 strings follow, terminated by | 2377 | | "break" | 2378 | | | 2379 | 0x80..0x97 | array (0x00..0x17 data items follow) | 2380 | | | 2381 | 0x98 | array (one-byte uint8_t for n, and then n data items | 2382 | | follow) | 2383 | | | 2384 | 0x99 | array (two-byte uint16_t for n, and then n data | 2385 | | items follow) | 2386 | | | 2387 | 0x9a | array (four-byte uint32_t for n, and then n data | 2388 | | items follow) | 2389 | | | 2390 | 0x9b | array (eight-byte uint64_t for n, and then n data | 2391 | | items follow) | 2392 | | | 2393 | 0x9f | array, data items follow, terminated by "break" | 2394 | | | 2395 | 0xa0..0xb7 | map (0x00..0x17 pairs of data items follow) | 2396 | | | 2397 | 0xb8 | map (one-byte uint8_t for n, and then n pairs of | 2398 | | data items follow) | 2399 | | | 2400 | 0xb9 | map (two-byte uint16_t for n, and then n pairs of | 2401 | | data items follow) | 2402 | | | 2403 | 0xba | map (four-byte uint32_t for n, and then n pairs of | 2404 | | data items follow) | 2405 | | | 2406 | 0xbb | map (eight-byte uint64_t for n, and then n pairs of | 2407 | | data items follow) | 2408 | | | 2409 | 0xbf | map, pairs of data items follow, terminated by | 2410 | | "break" | 2411 | | | 2412 | 0xc0 | Text-based date/time (data item follows; see | 2413 | | Section 3.4.1) | 2414 | | | 2415 | 0xc1 | Epoch-based date/time (data item follows; see | 2416 | | Section 3.4.1) | 2417 | | | 2418 | 0xc2 | Positive bignum (data item "byte string" follows) | 2419 | | | 2420 | 0xc3 | Negative bignum (data item "byte string" follows) | 2421 | | | 2422 | 0xc4 | Decimal Fraction (data item "array" follows; see | 2423 | | Section 3.4.3) | 2424 | | | 2425 | 0xc5 | Bigfloat (data item "array" follows; see | 2426 | | Section 3.4.3) | 2427 | | | 2428 | 0xc6..0xd4 | (tagged item) | 2429 | | | 2430 | 0xd5..0xd7 | Expected Conversion (data item follows; see | 2431 | | Section 3.4.4.2) | 2432 | | | 2433 | 0xd8..0xdb | (more tagged items, 1/2/4/8 bytes and then a data | 2434 | | item follow) | 2435 | | | 2436 | 0xe0..0xf3 | (simple value) | 2437 | | | 2438 | 0xf4 | False | 2439 | | | 2440 | 0xf5 | True | 2441 | | | 2442 | 0xf6 | Null | 2443 | | | 2444 | 0xf7 | Undefined | 2445 | | | 2446 | 0xf8 | (simple value, one byte follows) | 2447 | | | 2448 | 0xf9 | Half-Precision Float (two-byte IEEE 754) | 2449 | | | 2450 | 0xfa | Single-Precision Float (four-byte IEEE 754) | 2451 | | | 2452 | 0xfb | Double-Precision Float (eight-byte IEEE 754) | 2453 | | | 2454 | 0xff | "break" stop code | 2455 +------------+------------------------------------------------------+ 2457 Table 5: Jump Table for Initial Byte 2459 Appendix C. Pseudocode 2461 The well-formedness of a CBOR item can be checked by the pseudocode 2462 in Figure 1. The data is well-formed if and only if: 2464 o the pseudocode does not "fail"; 2466 o after execution of the pseudocode, no bytes are left in the input 2467 (except in streaming applications) 2469 The pseudocode has the following prerequisites: 2471 o take(n) reads n bytes from the input data and returns them as a 2472 byte string. If n bytes are no longer available, take(n) fails. 2474 o uint() converts a byte string into an unsigned integer by 2475 interpreting the byte string in network byte order. 2477 o Arithmetic works as in C. 2479 o All variables are unsigned integers of sufficient range. 2481 well_formed (breakable = false) { 2482 // process initial bytes 2483 ib = uint(take(1)); 2484 mt = ib >> 5; 2485 val = ai = ib & 0x1f; 2486 switch (ai) { 2487 case 24: val = uint(take(1)); break; 2488 case 25: val = uint(take(2)); break; 2489 case 26: val = uint(take(4)); break; 2490 case 27: val = uint(take(8)); break; 2491 case 28: case 29: case 30: fail(); 2492 case 31: 2493 return well_formed_indefinite(mt, breakable); 2494 } 2495 // process content 2496 switch (mt) { 2497 // case 0, 1, 7 do not have content; just use val 2498 case 2: case 3: take(val); break; // bytes/UTF-8 2499 case 4: for (i = 0; i < val; i++) well_formed(); break; 2500 case 5: for (i = 0; i < val*2; i++) well_formed(); break; 2501 case 6: well_formed(); break; // 1 embedded data item 2502 } 2503 return mt; // finite data item 2504 } 2506 well_formed_indefinite(mt, breakable) { 2507 switch (mt) { 2508 case 2: case 3: 2509 while ((it = well_formed(true)) != -1) 2510 if (it != mt) // need finite embedded 2511 fail(); // of same type 2512 break; 2513 case 4: while (well_formed(true) != -1); break; 2514 case 5: while (well_formed(true) != -1) well_formed(); break; 2515 case 7: 2516 if (breakable) 2517 return -1; // signal break out 2518 else fail(); // no enclosing indefinite 2519 default: fail(); // wrong mt 2520 } 2521 return 0; // no break out 2522 } 2524 Figure 1: Pseudocode for Well-Formedness Check 2526 Note that the remaining complexity of a complete CBOR decoder is 2527 about presenting data that has been parsed to the application in an 2528 appropriate form. 2530 Major types 0 and 1 are designed in such a way that they can be 2531 encoded in C from a signed integer without actually doing an if-then- 2532 else for positive/negative (Figure 2). This uses the fact that 2533 (-1-n), the transformation for major type 1, is the same as ~n 2534 (bitwise complement) in C unsigned arithmetic; ~n can then be 2535 expressed as (-1)^n for the negative case, while 0^n leaves n 2536 unchanged for non-negative. The sign of a number can be converted to 2537 -1 for negative and 0 for non-negative (0 or positive) by arithmetic- 2538 shifting the number by one bit less than the bit length of the number 2539 (for example, by 63 for 64-bit numbers). 2541 void encode_sint(int64_t n) { 2542 uint64t ui = n >> 63; // extend sign to whole length 2543 mt = ui & 0x20; // extract major type 2544 ui ^= n; // complement negatives 2545 if (ui < 24) 2546 *p++ = mt + ui; 2547 else if (ui < 256) { 2548 *p++ = mt + 24; 2549 *p++ = ui; 2550 } else 2551 ... 2553 Figure 2: Pseudocode for Encoding a Signed Integer 2555 Appendix D. Half-Precision 2557 As half-precision floating-point numbers were only added to IEEE 754 2558 in 2008, today's programming platforms often still only have limited 2559 support for them. It is very easy to include at least decoding 2560 support for them even without such support. An example of a small 2561 decoder for half-precision floating-point numbers in the C language 2562 is shown in Figure 3. A similar program for Python is in Figure 4; 2563 this code assumes that the 2-byte value has already been decoded as 2564 an (unsigned short) integer in network byte order (as would be done 2565 by the pseudocode in Appendix C). 2567 #include 2569 double decode_half(unsigned char *halfp) { 2570 int half = (halfp[0] << 8) + halfp[1]; 2571 int exp = (half >> 10) & 0x1f; 2572 int mant = half & 0x3ff; 2573 double val; 2574 if (exp == 0) val = ldexp(mant, -24); 2575 else if (exp != 31) val = ldexp(mant + 1024, exp - 25); 2576 else val = mant == 0 ? INFINITY : NAN; 2577 return half & 0x8000 ? -val : val; 2578 } 2580 Figure 3: C Code for a Half-Precision Decoder 2582 import struct 2583 from math import ldexp 2585 def decode_single(single): 2586 return struct.unpack("!f", struct.pack("!I", single))[0] 2588 def decode_half(half): 2589 valu = (half & 0x7fff) << 13 | (half & 0x8000) << 16 2590 if ((half & 0x7c00) != 0x7c00): 2591 return ldexp(decode_single(valu), 112) 2592 return decode_single(valu | 0x7f800000) 2594 Figure 4: Python Code for a Half-Precision Decoder 2596 Appendix E. Comparison of Other Binary Formats to CBOR's Design 2597 Objectives 2599 The proposal for CBOR follows a history of binary formats that is as 2600 long as the history of computers themselves. Different formats have 2601 had different objectives. In most cases, the objectives of the 2602 format were never stated, although they can sometimes be implied by 2603 the context where the format was first used. Some formats were meant 2604 to be universally usable, although history has proven that no binary 2605 format meets the needs of all protocols and applications. 2607 CBOR differs from many of these formats due to it starting with a set 2608 of objectives and attempting to meet just those. This section 2609 compares a few of the dozens of formats with CBOR's objectives in 2610 order to help the reader decide if they want to use CBOR or a 2611 different format for a particular protocol or application. 2613 Note that the discussion here is not meant to be a criticism of any 2614 format: to the best of our knowledge, no format before CBOR was meant 2615 to cover CBOR's objectives in the priority we have assigned them. A 2616 brief recap of the objectives from Section 1.1 is: 2618 1. unambiguous encoding of most common data formats from Internet 2619 standards 2621 2. code compactness for encoder or decoder 2623 3. no schema description needed 2625 4. reasonably compact serialization 2627 5. applicability to constrained and unconstrained applications 2629 6. good JSON conversion 2631 7. extensibility 2633 E.1. ASN.1 DER, BER, and PER 2635 [ASN.1] has many serializations. In the IETF, DER and BER are the 2636 most common. The serialized output is not particularly compact for 2637 many items, and the code needed to decode numeric items can be 2638 complex on a constrained device. 2640 Few (if any) IETF protocols have adopted one of the several variants 2641 of Packed Encoding Rules (PER). There could be many reasons for 2642 this, but one that is commonly stated is that PER makes use of the 2643 schema even for parsing the surface structure of the data stream, 2644 requiring significant tool support. There are different versions of 2645 the ASN.1 schema language in use, which has also hampered adoption. 2647 E.2. MessagePack 2649 [MessagePack] is a concise, widely implemented counted binary 2650 serialization format, similar in many properties to CBOR, although 2651 somewhat less regular. While the data model can be used to represent 2652 JSON data, MessagePack has also been used in many remote procedure 2653 call (RPC) applications and for long-term storage of data. 2655 MessagePack has been essentially stable since it was first published 2656 around 2011; it has not yet had a transition. The evolution of 2657 MessagePack is impeded by an imperative to maintain complete 2658 backwards compatibility with existing stored data, while only few 2659 bytecodes are still available for extension. Repeated requests over 2660 the years from the MessagePack user community to separate out binary 2661 and text strings in the encoding recently have led to an extension 2662 proposal that would leave MessagePack's "raw" data ambiguous between 2663 its usages for binary and text data. The extension mechanism for 2664 MessagePack remains unclear. 2666 E.3. BSON 2668 [BSON] is a data format that was developed for the storage of JSON- 2669 like maps (JSON objects) in the MongoDB database. Its major 2670 distinguishing feature is the capability for in-place update, 2671 foregoing a compact representation. BSON uses a counted 2672 representation except for map keys, which are null-byte terminated. 2673 While BSON can be used for the representation of JSON-like objects on 2674 the wire, its specification is dominated by the requirements of the 2675 database application and has become somewhat baroque. The status of 2676 how BSON extensions will be implemented remains unclear. 2678 E.4. UBJSON 2680 [UBJSON] has a design goal to make JSON faster and somewhat smaller, 2681 using a binary format that is limited to exactly the data model JSON 2682 uses. Thus, there is expressly no intention to support, for example, 2683 binary data; however, there is a "high-precision number", expressed 2684 as a character string in JSON syntax. UBJSON is not optimized for 2685 code compactness, and its type byte coding is optimized for human 2686 recognition and not for compact representation of native types such 2687 as small integers. Although UBJSON is mostly counted, it provides a 2688 reserved "unknown-length" value to support streaming of arrays and 2689 maps (JSON objects). Within these containers, UBJSON also has a 2690 "Noop" type for padding. 2692 E.5. MSDTP: RFC 713 2694 Message Services Data Transmission (MSDTP) is a very early example of 2695 a compact message format; it is described in [RFC0713], written in 2696 1976. It is included here for its historical value, not because it 2697 was ever widely used. 2699 E.6. Conciseness on the Wire 2701 While CBOR's design objective of code compactness for encoders and 2702 decoders is a higher priority than its objective of conciseness on 2703 the wire, many people focus on the wire size. Table 6 shows some 2704 encoding examples for the simple nested array [1, [2, 3]]; where some 2705 form of indefinite-length encoding is supported by the encoding, 2706 [_ 1, [2, 3]] (indefinite length on the outer array) is also shown. 2708 +-------------+--------------------------+--------------------------+ 2709 | Format | [1, [2, 3]] | [_ 1, [2, 3]] | 2710 +-------------+--------------------------+--------------------------+ 2711 | RFC 713 | c2 05 81 c2 02 82 83 | | 2712 | | | | 2713 | ASN.1 BER | 30 0b 02 01 01 30 06 02 | 30 80 02 01 01 30 06 02 | 2714 | | 01 02 02 01 03 | 01 02 02 01 03 00 00 | 2715 | | | | 2716 | MessagePack | 92 01 92 02 03 | | 2717 | | | | 2718 | BSON | 22 00 00 00 10 30 00 01 | | 2719 | | 00 00 00 04 31 00 13 00 | | 2720 | | 00 00 10 30 00 02 00 00 | | 2721 | | 00 10 31 00 03 00 00 00 | | 2722 | | 00 00 | | 2723 | | | | 2724 | UBJSON | 61 02 42 01 61 02 42 02 | 61 ff 42 01 61 02 42 02 | 2725 | | 42 03 | 42 03 45 | 2726 | | | | 2727 | CBOR | 82 01 82 02 03 | 9f 01 82 02 03 ff | 2728 +-------------+--------------------------+--------------------------+ 2730 Table 6: Examples for Different Levels of Conciseness 2732 Appendix F. Changes from RFC 7049 2734 The following is a list of known changes from RFC 7049. This list is 2735 non-authoritative. It is meant to help reviewers see the significant 2736 differences. 2738 o Updated reference for [RFC4267] to [RFC8259] in many places 2740 o Updated reference for [CNN-TERMS] to [RFC7228] 2742 o Added a comment to the last example in Section 2.2.1 (added 2743 "Second value") 2745 o Fixed a bug in the example in Section 2.4.2 ("29" -> "49") 2747 o Fixed a bug in the last paragraph of Section 3.6 ("0b000_11101" -> 2748 "0b000_11001") 2750 Authors' Addresses 2751 Carsten Bormann 2752 Universitaet Bremen TZI 2753 Postfach 330440 2754 D-28359 Bremen 2755 Germany 2757 Phone: +49-421-218-63921 2758 EMail: cabo@tzi.org 2760 Paul Hoffman 2761 ICANN 2763 EMail: paul.hoffman@icann.org