idnits 2.17.1 draft-ietf-cbor-7049bis-12.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (18 December 2019) is 1592 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '2' on line 2566 -- Looks like a reference, but probably isn't: '3' on line 2566 -- Looks like a reference, but probably isn't: '4' on line 2564 -- Looks like a reference, but probably isn't: '5' on line 2564 -- Looks like a reference, but probably isn't: '100' on line 1406 == Missing Reference: '-1' is mentioned on line 1402, but not defined -- Looks like a reference, but probably isn't: '1' on line 2852 == Missing Reference: 'RFCthis' is mentioned on line 2133, but not defined == Missing Reference: 'TM' is mentioned on line 2385, but not defined -- Looks like a reference, but probably isn't: '0' on line 2868 == Missing Reference: 'RFC4627' is mentioned on line 3008, but not defined ** Obsolete undefined reference: RFC 4627 (Obsoleted by RFC 7158, RFC 7159) == Missing Reference: 'CNN-TERMS' is mentioned on line 3010, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'ECMA262' -- Possible downref: Non-RFC (?) normative reference: ref. 'IEEE754' -- Obsolete informational reference (is this intentional?): RFC 7049 (Obsoleted by RFC 8949) Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Bormann 3 Internet-Draft Universitaet Bremen TZI 4 Obsoletes: 7049 (if approved) P. Hoffman 5 Intended status: Standards Track ICANN 6 Expires: 20 June 2020 18 December 2019 8 Concise Binary Object Representation (CBOR) 9 draft-ietf-cbor-7049bis-12 11 Abstract 13 The Concise Binary Object Representation (CBOR) is a data format 14 whose design goals include the possibility of extremely small code 15 size, fairly small message size, and extensibility without the need 16 for version negotiation. These design goals make it different from 17 earlier binary serializations such as ASN.1 and MessagePack. 19 This document is a revised edition of RFC 7049, with editorial 20 improvements, added detail, and fixed errata. This revision formally 21 obsoletes RFC 7049, while keeping full compatibility of the 22 interchange format from RFC 7049. It does not create a new version 23 of the format. 25 Contributing 27 This document is being worked on in the CBOR Working Group. Please 28 contribute on the mailing list there, or in the GitHub repository for 29 this draft: https://github.com/cbor-wg/CBORbis 31 The charter for the CBOR Working Group says that the WG will update 32 RFC 7049 to fix verified errata. Security issues and clarifications 33 may be addressed, but changes to this document will ensure backward 34 compatibility for popular deployed codebases. This document will be 35 targeted at becoming an Internet Standard. 37 Status of This Memo 39 This Internet-Draft is submitted in full conformance with the 40 provisions of BCP 78 and BCP 79. 42 Internet-Drafts are working documents of the Internet Engineering 43 Task Force (IETF). Note that other groups may also distribute 44 working documents as Internet-Drafts. The list of current Internet- 45 Drafts is at https://datatracker.ietf.org/drafts/current/. 47 Internet-Drafts are draft documents valid for a maximum of six months 48 and may be updated, replaced, or obsoleted by other documents at any 49 time. It is inappropriate to use Internet-Drafts as reference 50 material or to cite them other than as "work in progress." 52 This Internet-Draft will expire on 20 June 2020. 54 Copyright Notice 56 Copyright (c) 2019 IETF Trust and the persons identified as the 57 document authors. All rights reserved. 59 This document is subject to BCP 78 and the IETF Trust's Legal 60 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 61 license-info) in effect on the date of publication of this document. 62 Please review these documents carefully, as they describe your rights 63 and restrictions with respect to this document. Code Components 64 extracted from this document must include Simplified BSD License text 65 as described in Section 4.e of the Trust Legal Provisions and are 66 provided without warranty as described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 71 1.1. Objectives . . . . . . . . . . . . . . . . . . . . . . . 4 72 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 73 2. CBOR Data Models . . . . . . . . . . . . . . . . . . . . . . 7 74 2.1. Extended Generic Data Models . . . . . . . . . . . . . . 8 75 2.2. Specific Data Models . . . . . . . . . . . . . . . . . . 9 76 3. Specification of the CBOR Encoding . . . . . . . . . . . . . 9 77 3.1. Major Types . . . . . . . . . . . . . . . . . . . . . . . 11 78 3.2. Indefinite Lengths for Some Major Types . . . . . . . . . 13 79 3.2.1. The "break" Stop Code . . . . . . . . . . . . . . . . 13 80 3.2.2. Indefinite-Length Arrays and Maps . . . . . . . . . . 14 81 3.2.3. Indefinite-Length Byte Strings and Text Strings . . . 16 82 3.3. Floating-Point Numbers and Values with No Content . . . . 16 83 3.4. Tagging of Items . . . . . . . . . . . . . . . . . . . . 18 84 3.4.1. Standard Date/Time String . . . . . . . . . . . . . . 20 85 3.4.2. Epoch-based Date/Time . . . . . . . . . . . . . . . . 20 86 3.4.3. Bignums . . . . . . . . . . . . . . . . . . . . . . . 21 87 3.4.4. Decimal Fractions and Bigfloats . . . . . . . . . . . 22 88 3.4.5. Content Hints . . . . . . . . . . . . . . . . . . . . 23 89 3.4.5.1. Encoded CBOR Data Item . . . . . . . . . . . . . 23 90 3.4.5.2. Expected Later Encoding for CBOR-to-JSON 91 Converters . . . . . . . . . . . . . . . . . . . . 24 92 3.4.5.3. Encoded Text . . . . . . . . . . . . . . . . . . 24 93 3.4.6. Self-Described CBOR . . . . . . . . . . . . . . . . . 25 94 4. Serialization Considerations . . . . . . . . . . . . . . . . 26 95 4.1. Preferred Serialization . . . . . . . . . . . . . . . . . 26 96 4.2. Deterministically Encoded CBOR . . . . . . . . . . . . . 27 97 4.2.1. Core Deterministic Encoding Requirements . . . . . . 27 98 4.2.2. Additional Deterministic Encoding Considerations . . 28 99 4.2.3. Length-first map key ordering . . . . . . . . . . . . 30 100 5. Creating CBOR-Based Protocols . . . . . . . . . . . . . . . . 31 101 5.1. CBOR in Streaming Applications . . . . . . . . . . . . . 31 102 5.2. Generic Encoders and Decoders . . . . . . . . . . . . . . 32 103 5.3. Validity of Items . . . . . . . . . . . . . . . . . . . . 32 104 5.3.1. Basic validity . . . . . . . . . . . . . . . . . . . 33 105 5.3.2. Tag validity . . . . . . . . . . . . . . . . . . . . 33 106 5.4. Validity and Evolution . . . . . . . . . . . . . . . . . 34 107 5.5. Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 35 108 5.6. Specifying Keys for Maps . . . . . . . . . . . . . . . . 35 109 5.6.1. Equivalence of Keys . . . . . . . . . . . . . . . . . 36 110 5.7. Undefined Values . . . . . . . . . . . . . . . . . . . . 37 111 6. Converting Data between CBOR and JSON . . . . . . . . . . . . 38 112 6.1. Converting from CBOR to JSON . . . . . . . . . . . . . . 38 113 6.2. Converting from JSON to CBOR . . . . . . . . . . . . . . 39 114 7. Future Evolution of CBOR . . . . . . . . . . . . . . . . . . 40 115 7.1. Extension Points . . . . . . . . . . . . . . . . . . . . 41 116 7.2. Curating the Additional Information Space . . . . . . . . 41 117 8. Diagnostic Notation . . . . . . . . . . . . . . . . . . . . . 42 118 8.1. Encoding Indicators . . . . . . . . . . . . . . . . . . . 43 119 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 44 120 9.1. Simple Values Registry . . . . . . . . . . . . . . . . . 44 121 9.2. Tags Registry . . . . . . . . . . . . . . . . . . . . . . 44 122 9.3. Media Type ("MIME Type") . . . . . . . . . . . . . . . . 45 123 9.4. CoAP Content-Format . . . . . . . . . . . . . . . . . . . 45 124 9.5. The +cbor Structured Syntax Suffix Registration . . . . . 46 125 10. Security Considerations . . . . . . . . . . . . . . . . . . . 47 126 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 48 127 11.1. Normative References . . . . . . . . . . . . . . . . . . 48 128 11.2. Informative References . . . . . . . . . . . . . . . . . 50 129 Appendix A. Examples . . . . . . . . . . . . . . . . . . . . . . 51 130 Appendix B. Jump Table . . . . . . . . . . . . . . . . . . . . . 55 131 Appendix C. Pseudocode . . . . . . . . . . . . . . . . . . . . . 58 132 Appendix D. Half-Precision . . . . . . . . . . . . . . . . . . . 61 133 Appendix E. Comparison of Other Binary Formats to CBOR's Design 134 Objectives . . . . . . . . . . . . . . . . . . . . . . . 62 135 E.1. ASN.1 DER, BER, and PER . . . . . . . . . . . . . . . . . 63 136 E.2. MessagePack . . . . . . . . . . . . . . . . . . . . . . . 63 137 E.3. BSON . . . . . . . . . . . . . . . . . . . . . . . . . . 64 138 E.4. MSDTP: RFC 713 . . . . . . . . . . . . . . . . . . . . . 64 139 E.5. Conciseness on the Wire . . . . . . . . . . . . . . . . . 64 140 Appendix F. Changes from RFC 7049 . . . . . . . . . . . . . . . 65 141 Appendix G. Well-formedness errors and examples . . . . . . . . 65 142 G.1. Examples for CBOR data items that are not well-formed . . 66 143 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 68 144 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 69 146 1. Introduction 148 There are hundreds of standardized formats for binary representation 149 of structured data (also known as binary serialization formats). Of 150 those, some are for specific domains of information, while others are 151 generalized for arbitrary data. In the IETF, probably the best-known 152 formats in the latter category are ASN.1's BER and DER [ASN.1]. 154 The format defined here follows some specific design goals that are 155 not well met by current formats. The underlying data model is an 156 extended version of the JSON data model [RFC8259]. It is important 157 to note that this is not a proposal that the grammar in RFC 8259 be 158 extended in general, since doing so would cause a significant 159 backwards incompatibility with already deployed JSON documents. 160 Instead, this document simply defines its own data model that starts 161 from JSON. 163 Appendix E lists some existing binary formats and discusses how well 164 they do or do not fit the design objectives of the Concise Binary 165 Object Representation (CBOR). 167 This document is a revised edition of [RFC7049], with editorial 168 improvements, added detail, and fixed errata. This revision formally 169 obsoletes RFC 7049, while keeping full compatibility of the 170 interchange format from RFC 7049. It does not create a new version 171 of the format. 173 1.1. Objectives 175 The objectives of CBOR, roughly in decreasing order of importance, 176 are: 178 1. The representation must be able to unambiguously encode most 179 common data formats used in Internet standards. 181 * It must represent a reasonable set of basic data types and 182 structures using binary encoding. "Reasonable" here is 183 largely influenced by the capabilities of JSON, with the major 184 addition of binary byte strings. The structures supported are 185 limited to arrays and trees; loops and lattice-style graphs 186 are not supported. 188 * There is no requirement that all data formats be uniquely 189 encoded; that is, it is acceptable that the number "7" might 190 be encoded in multiple different ways. 192 2. The code for an encoder or decoder must be able to be compact in 193 order to support systems with very limited memory, processor 194 power, and instruction sets. 196 * An encoder and a decoder need to be implementable in a very 197 small amount of code (for example, in class 1 constrained 198 nodes as defined in [RFC7228]). 200 * The format should use contemporary machine representations of 201 data (for example, not requiring binary-to-decimal 202 conversion). 204 3. Data must be able to be decoded without a schema description. 206 * Similar to JSON, encoded data should be self-describing so 207 that a generic decoder can be written. 209 4. The serialization must be reasonably compact, but data 210 compactness is secondary to code compactness for the encoder and 211 decoder. 213 * "Reasonable" here is bounded by JSON as an upper bound in 214 size, and by implementation complexity maintaining a lower 215 bound. Using either general compression schemes or extensive 216 bit-fiddling violates the complexity goals. 218 5. The format must be applicable to both constrained nodes and high- 219 volume applications. 221 * This means it must be reasonably frugal in CPU usage for both 222 encoding and decoding. This is relevant both for constrained 223 nodes and for potential usage in applications with a very high 224 volume of data. 226 6. The format must support all JSON data types for conversion to and 227 from JSON. 229 * It must support a reasonable level of conversion as long as 230 the data represented is within the capabilities of JSON. It 231 must be possible to define a unidirectional mapping towards 232 JSON for all types of data. 234 7. The format must be extensible, and the extended data must be 235 decodable by earlier decoders. 237 * The format is designed for decades of use. 239 * The format must support a form of extensibility that allows 240 fallback so that a decoder that does not understand an 241 extension can still decode the message. 243 * The format must be able to be extended in the future by later 244 IETF standards. 246 1.2. Terminology 248 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 249 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 250 "OPTIONAL" in this document are to be interpreted as described in 251 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 252 capitals, as shown here. 254 The term "byte" is used in its now-customary sense as a synonym for 255 "octet". All multi-byte values are encoded in network byte order 256 (that is, most significant byte first, also known as "big-endian"). 258 This specification makes use of the following terminology: 260 Data item: A single piece of CBOR data. The structure of a data 261 item may contain zero, one, or more nested data items. The term 262 is used both for the data item in representation format and for 263 the abstract idea that can be derived from that by a decoder; the 264 former can be addressed specifically by using "encoded data item". 266 Decoder: A process that decodes a well-formed encoded CBOR data item 267 and makes it available to an application. Formally speaking, a 268 decoder contains a parser to break up the input using the syntax 269 rules of CBOR, as well as a semantic processor to prepare the data 270 in a form suitable to the application. 272 Encoder: A process that generates the (well-formed) representation 273 format of a CBOR data item from application information. 275 Data Stream: A sequence of zero or more data items, not further 276 assembled into a larger containing data item. The independent 277 data items that make up a data stream are sometimes also referred 278 to as "top-level data items". 280 Well-formed: A data item that follows the syntactic structure of 281 CBOR. A well-formed data item uses the initial bytes and the byte 282 strings and/or data items that are implied by their values as 283 defined in CBOR and does not include following extraneous data. 284 CBOR decoders by definition only return contents from well-formed 285 data items. 287 Valid: A data item that is well-formed and also follows the semantic 288 restrictions that apply to CBOR data items. 290 Expected: Besides its normal English meaning, the term "expected" is 291 used to describe requirements beyond CBOR validity that an 292 application has on its input data. Well-formed (processable at 293 all), valid (checked by a validity-checking generic decoder), and 294 expected (checked by the application) form a hierarchy of layers 295 of acceptability. 297 Stream decoder: A process that decodes a data stream and makes each 298 of the data items in the sequence available to an application as 299 they are received. 301 Where bit arithmetic or data types are explained, this document uses 302 the notation familiar from the programming language C, except that 303 "**" denotes exponentiation. Similar to the "0x" notation for 304 hexadecimal numbers, numbers in binary notation are prefixed with 305 "0b". Underscores can be added to a number solely for readability, 306 so 0b00100001 (0x21) might be written 0b001_00001 to emphasize the 307 desired interpretation of the bits in the byte; in this case, it is 308 split into three bits and five bits. Encoded CBOR data items are 309 sometimes given in the "0x" or "0b" notation; these values are first 310 interpreted as numbers as in C and are then interpreted as byte 311 strings in network byte order, including any leading zero bytes 312 expressed in the notation. 314 2. CBOR Data Models 316 CBOR is explicit about its generic data model, which defines the set 317 of all data items that can be represented in CBOR. Its basic generic 318 data model is extensible by the registration of simple type values 319 and tags. Applications can then subset the resulting extended 320 generic data model to build their specific data models. 322 Within environments that can represent the data items in the generic 323 data model, generic CBOR encoders and decoders can be implemented 324 (which usually involves defining additional implementation data types 325 for those data items that do not already have a natural 326 representation in the environment). The ability to provide generic 327 encoders and decoders is an explicit design goal of CBOR; however 328 many applications will provide their own application-specific 329 encoders and/or decoders. 331 In the basic (un-extended) generic data model, a data item is one of: 333 * an integer in the range -2**64..2**64-1 inclusive 334 * a simple value, identified by a number between 0 and 255, but 335 distinct from that number 337 * a floating-point value, distinct from an integer, out of the set 338 representable by IEEE 754 binary64 (including non-finites) 339 [IEEE754] 341 * a sequence of zero or more bytes ("byte string") 343 * a sequence of zero or more Unicode code points ("text string") 345 * a sequence of zero or more data items ("array") 347 * a mapping (mathematical function) from zero or more data items 348 ("keys") each to a data item ("values"), ("map") 350 * a tagged data item ("tag"), comprising a tag number (an integer in 351 the range 0..2**64-1) and a tagged value (a data item) 353 Note that integer and floating-point values are distinct in this 354 model, even if they have the same numeric value. 356 Also note that serialization variants, such as the number of bytes of 357 the encoded floating value, or the choice of one of the ways in which 358 an integer, the length of a text or byte string, the number of 359 elements in an array or pairs in a map, or a tag number, 360 (collectively "the argument", see Section 3) can be encoded, are not 361 visible at the generic data model level. 363 2.1. Extended Generic Data Models 365 This basic generic data model comes pre-extended by the registration 366 of a number of simple values and tag numbers right in this document, 367 such as: 369 * "false", "true", "null", and "undefined" (simple values identified 370 by 20..23) 372 * integer and floating-point values with a larger range and 373 precision than the above (tag numbers 2 to 5) 375 * application data types such as a point in time or an RFC 3339 376 date/time string (tag numbers 1, 0) 378 Further elements of the extended generic data model can be (and have 379 been) defined via the IANA registries created for CBOR. Even if such 380 an extension is unknown to a generic encoder or decoder, data items 381 using that extension can be passed to or from the application by 382 representing them at the interface to the application within the 383 basic generic data model, i.e., as generic values of a simple type or 384 generic tags. 386 In other words, the basic generic data model is stable as defined in 387 this document, while the extended generic data model expands by the 388 registration of new simple values or tag numbers, but never shrinks. 390 While there is a strong expectation that generic encoders and 391 decoders can represent "false", "true", and "null" ("undefined" is 392 intentionally omitted) in the form appropriate for their programming 393 environment, implementation of the data model extensions created by 394 tags is truly optional and a matter of implementation quality. 396 2.2. Specific Data Models 398 The specific data model for a CBOR-based protocol usually subsets the 399 extended generic data model and assigns application semantics to the 400 data items within this subset and its components. When documenting 401 such specific data models, where it is desired to specify the types 402 of data items, it is preferred to identify the types by the names 403 they have in the generic data model ("negative integer", "array") 404 instead of by referring to aspects of their CBOR representation 405 ("major type 1", "major type 4"). 407 Specific data models can also specify what values (including values 408 of different types) are equivalent for the purposes of map keys and 409 encoder freedom. For example, in the generic data model, a valid map 410 MAY have both "0" and "0.0" as keys, and an encoder MUST NOT encode 411 "0.0" as an integer (major type 0, Section 3.1). However, if a 412 specific data model declares that floating-point and integer 413 representations of integral values are equivalent, using both map 414 keys "0" and "0.0" in a single map would be considered duplicates, 415 even while encoded as different major types, and so invalid; and an 416 encoder could encode integral-valued floats as integers or vice 417 versa, perhaps to save encoded bytes. 419 3. Specification of the CBOR Encoding 421 A CBOR data item (Section 2) is encoded to or decoded from a byte 422 string carrying a well-formed encoded data item as described in this 423 section. The encoding is summarized in Table 6, indexed by the 424 initial byte. An encoder MUST produce only well-formed encoded data 425 items. A decoder MUST NOT return a decoded data item when it 426 encounters input that is not a well-formed encoded CBOR data item 427 (this does not detract from the usefulness of diagnostic and recovery 428 tools that might make available some information from a damaged 429 encoded CBOR data item). 431 The initial byte of each encoded data item contains both information 432 about the major type (the high-order 3 bits, described in 433 Section 3.1) and additional information (the low-order 5 bits). With 434 a few exceptions, the additional information's value describes how to 435 load an unsigned integer "argument": 437 Less than 24: The argument's value is the value of the additional 438 information. 440 24, 25, 26, or 27: The argument's value is held in the following 1, 441 2, 4, or 8 bytes, respectively, in network byte order. For major 442 type 7 and additional information value 25, 26, 27, these bytes 443 are not used as an integer argument, but as a floating-point value 444 (see Section 3.3). 446 28, 29, 30: These values are reserved for future additions to the 447 CBOR format. In the present version of CBOR, the encoded item is 448 not well-formed. 450 31: No argument value is derived. If the major type is 0, 1, or 6, 451 the encoded item is not well-formed. For major types 2 to 5, the 452 item's length is indefinite, and for major type 7, the byte does 453 not consitute a data item at all but terminates an indefinite 454 length item; both are described in Section 3.2. 456 The initial byte and any additional bytes consumed to construct the 457 argument are collectively referred to as the "head" of the data item. 459 The meaning of this argument depends on the major type. For example, 460 in major type 0, the argument is the value of the data item itself 461 (and in major type 1 the value of the data item is computed from the 462 argument); in major type 2 and 3 it gives the length of the string 463 data in bytes that follows; and in major types 4 and 5 it is used to 464 determine the number of data items enclosed. 466 If the encoded sequence of bytes ends before the end of a data item, 467 that item is not well-formed. If the encoded sequence of bytes still 468 has bytes remaining after the outermost encoded item is decoded, that 469 encoding is not a single well-formed CBOR item; depending on the 470 application, the decoder may either treat the encoding as not well- 471 formed or just identify the start of the remaining bytes to the 472 application. 474 A CBOR decoder implementation can be based on a jump table with all 475 256 defined values for the initial byte (Table 6). A decoder in a 476 constrained implementation can instead use the structure of the 477 initial byte and following bytes for more compact code (see 478 Appendix C for a rough impression of how this could look). 480 3.1. Major Types 482 The following lists the major types and the additional information 483 and other bytes associated with the type. 485 Major type 0: an integer in the range 0..2**64-1 inclusive. The 486 value of the encoded item is the argument itself. For example, 487 the integer 10 is denoted as the one byte 0b000_01010 (major type 488 0, additional information 10). The integer 500 would be 489 0b000_11001 (major type 0, additional information 25) followed by 490 the two bytes 0x01f4, which is 500 in decimal. 492 Major type 1: a negative integer in the range -2**64..-1 inclusive. 493 The value of the item is -1 minus the argument. For example, the 494 integer -500 would be 0b001_11001 (major type 1, additional 495 information 25) followed by the two bytes 0x01f3, which is 499 in 496 decimal. 498 Major type 2: a byte string. The number of bytes in the string is 499 equal to the argument. For example, a byte string whose length is 500 5 would have an initial byte of 0b010_00101 (major type 2, 501 additional information 5 for the length), followed by 5 bytes of 502 binary content. A byte string whose length is 500 would have 3 503 initial bytes of 0b010_11001 (major type 2, additional information 504 25 to indicate a two-byte length) followed by the two bytes 0x01f4 505 for a length of 500, followed by 500 bytes of binary content. 507 Major type 3: a text string (Section 2), encoded as UTF-8 508 ([RFC3629]). The number of bytes in the string is equal to the 509 argument. A string containing an invalid UTF-8 sequence is well- 510 formed but invalid. This type is provided for systems that need 511 to interpret or display human-readable text, and allows the 512 differentiation between unstructured bytes and text that has a 513 specified repertoire and encoding. In contrast to formats such as 514 JSON, the Unicode characters in this type are never escaped. 515 Thus, a newline character (U+000A) is always represented in a 516 string as the byte 0x0a, and never as the bytes 0x5c6e (the 517 characters "\" and "n") or as 0x5c7530303061 (the characters "\", 518 "u", "0", "0", "0", and "a"). 520 Major type 4: an array of data items. Arrays are also called lists, 521 sequences, or tuples. The argument is the number of data items in 522 the array. Items in an array do not need to all be of the same 523 type. For example, an array that contains 10 items of any type 524 would have an initial byte of 0b100_01010 (major type of 4, 525 additional information of 10 for the length) followed by the 10 526 remaining items. 528 Major type 5: a map of pairs of data items. Maps are also called 529 tables, dictionaries, hashes, or objects (in JSON). A map is 530 comprised of pairs of data items, each pair consisting of a key 531 that is immediately followed by a value. The argument is the 532 number of _pairs_ of data items in the map. For example, a map 533 that contains 9 pairs would have an initial byte of 0b101_01001 534 (major type of 5, additional information of 9 for the number of 535 pairs) followed by the 18 remaining items. The first item is the 536 first key, the second item is the first value, the third item is 537 the second key, and so on. Because items in a map come in pairs, 538 their total number is always even: A map that contains an odd 539 number of items (no value data present after the last key data 540 item) is not well-formed. A map that has duplicate keys may be 541 well-formed, but it is not valid, and thus it causes indeterminate 542 decoding; see also Section 5.6. 544 Major type 6: a tagged data item ("tag") whose tag number is the 545 argument and whose enclosed data item ("tag content") is the 546 single encoded data item that follows the head. See Section 3.4. 548 Major type 7: floating-point numbers and simple values, as well as 549 the "break" stop code. See Section 3.3. 551 These eight major types lead to a simple table showing which of the 552 256 possible values for the initial byte of a data item are used 553 (Table 6). 555 In major types 6 and 7, many of the possible values are reserved for 556 future specification. See Section 9 for more information on these 557 values. 559 Table 1 summarizes the major types defined by CBOR, ignoring the next 560 section for now. The number N in this table stands for the argument, 561 mt for the major type. 563 +----+-----------------------+---------------------------------+ 564 | mt | Meaning | Content | 565 +====+=======================+=================================+ 566 | 0 | unsigned integer N | - | 567 +----+-----------------------+---------------------------------+ 568 | 1 | negative integer -1-N | - | 569 +----+-----------------------+---------------------------------+ 570 | 2 | byte string | N bytes | 571 +----+-----------------------+---------------------------------+ 572 | 3 | text string | N bytes (UTF-8 text) | 573 +----+-----------------------+---------------------------------+ 574 | 4 | array | N data items (elements) | 575 +----+-----------------------+---------------------------------+ 576 | 5 | map | 2N data items (key/value pairs) | 577 +----+-----------------------+---------------------------------+ 578 | 6 | tag of number N | 1 data item | 579 +----+-----------------------+---------------------------------+ 580 | 7 | simple/float | - | 581 +----+-----------------------+---------------------------------+ 583 Table 1: Overview over CBOR major types (definite length 584 encoded) 586 3.2. Indefinite Lengths for Some Major Types 588 Four CBOR items (arrays, maps, byte strings, and text strings) can be 589 encoded with an indefinite length using additional information value 590 31. This is useful if the encoding of the item needs to begin before 591 the number of items inside the array or map, or the total length of 592 the string, is known. (The application of this is often referred to 593 as "streaming" within a data item.) 595 Indefinite-length arrays and maps are dealt with differently than 596 indefinite-length byte strings and text strings. 598 3.2.1. The "break" Stop Code 600 The "break" stop code is encoded with major type 7 and additional 601 information value 31 (0b111_11111). It is not itself a data item: it 602 is just a syntactic feature to close an indefinite-length item. 604 If the "break" stop code appears anywhere where a data item is 605 expected, other than directly inside an indefinite-length string, 606 array, or map -- for example directly inside a definite-length array 607 or map -- the enclosing item is not well-formed. 609 3.2.2. Indefinite-Length Arrays and Maps 611 Indefinite-length arrays and maps are represented using their major 612 type with the additional information value of 31, followed by an 613 arbitrary-length sequence of zero or more items for an array or key/ 614 value pairs for a map, followed by the "break" stop code 615 (Section 3.2.1). In other words, indefinite-length arrays and maps 616 look identical to other arrays and maps except for beginning with the 617 additional information value of 31 and ending with the "break" stop 618 code. 620 If the break stop code appears after a key in a map, in place of that 621 key's value, the map is not well-formed. 623 There is no restriction against nesting indefinite-length array or 624 map items. A "break" only terminates a single item, so nested 625 indefinite-length items need exactly as many "break" stop codes as 626 there are type bytes starting an indefinite-length item. 628 For example, assume an encoder wants to represent the abstract array 629 [1, [2, 3], [4, 5]]. The definite-length encoding would be 630 0x8301820203820405: 632 83 -- Array of length 3 633 01 -- 1 634 82 -- Array of length 2 635 02 -- 2 636 03 -- 3 637 82 -- Array of length 2 638 04 -- 4 639 05 -- 5 641 Indefinite-length encoding could be applied independently to each of 642 the three arrays encoded in this data item, as required, leading to 643 representations such as: 645 0x9f018202039f0405ffff 646 9F -- Start indefinite-length array 647 01 -- 1 648 82 -- Array of length 2 649 02 -- 2 650 03 -- 3 651 9F -- Start indefinite-length array 652 04 -- 4 653 05 -- 5 654 FF -- "break" (inner array) 655 FF -- "break" (outer array) 657 0x9f01820203820405ff 658 9F -- Start indefinite-length array 659 01 -- 1 660 82 -- Array of length 2 661 02 -- 2 662 03 -- 3 663 82 -- Array of length 2 664 04 -- 4 665 05 -- 5 666 FF -- "break" 668 0x83018202039f0405ff 669 83 -- Array of length 3 670 01 -- 1 671 82 -- Array of length 2 672 02 -- 2 673 03 -- 3 674 9F -- Start indefinite-length array 675 04 -- 4 676 05 -- 5 677 FF -- "break" 679 0x83019f0203ff820405 680 83 -- Array of length 3 681 01 -- 1 682 9F -- Start indefinite-length array 683 02 -- 2 684 03 -- 3 685 FF -- "break" 686 82 -- Array of length 2 687 04 -- 4 688 05 -- 5 690 An example of an indefinite-length map (that happens to have two key/ 691 value pairs) might be: 693 0xbf6346756ef563416d7421ff 694 BF -- Start indefinite-length map 695 63 -- First key, UTF-8 string length 3 696 46756e -- "Fun" 697 F5 -- First value, true 698 63 -- Second key, UTF-8 string length 3 699 416d74 -- "Amt" 700 21 -- Second value, -2 701 FF -- "break" 703 3.2.3. Indefinite-Length Byte Strings and Text Strings 705 Indefinite-length strings are represented by a byte containing the 706 major type and additional information value of 31, followed by a 707 series of zero or more byte or text strings ("chunks") that have 708 definite lengths, followed by the "break" stop code (Section 3.2.1). 709 The data item represented by the indefinite-length string is the 710 concatenation of the chunks (i.e., the empty byte or text string, 711 respectively, if no chunk is present). (Note that zero-length 712 chunks, while not particularly useful, are permitted.) 714 If any item between the indefinite-length string indicator 715 (0b010_11111 or 0b011_11111) and the "break" stop code is not a 716 definite-length string item of the same major type, the string is not 717 well-formed. 719 If any definite-length text string inside an indefinite-length text 720 string is invalid, the indefinite-length text string is invalid. 721 Note that this implies that the bytes of a single UTF-8 character 722 cannot be spread between chunks: a new chunk can only be started at a 723 character boundary. 725 For example, assume the sequence: 727 0b010_11111 0b010_00100 0xaabbccdd 0b010_00011 0xeeff99 0b111_11111 729 5F -- Start indefinite-length byte string 730 44 -- Byte string of length 4 731 aabbccdd -- Bytes content 732 43 -- Byte string of length 3 733 eeff99 -- Bytes content 734 FF -- "break" 736 After decoding, this results in a single byte string with seven 737 bytes: 0xaabbccddeeff99. 739 3.3. Floating-Point Numbers and Values with No Content 741 Major type 7 is for two types of data: floating-point numbers and 742 "simple values" that do not need any content. Each value of the 743 5-bit additional information in the initial byte has its own separate 744 meaning, as defined in Table 2. Like the major types for integers, 745 items of this major type do not carry content data; all the 746 information is in the initial bytes. 748 +-------------+---------------------------------------------------+ 749 | 5-Bit Value | Semantics | 750 +=============+===================================================+ 751 | 0..23 | Simple value (value 0..23) | 752 +-------------+---------------------------------------------------+ 753 | 24 | Simple value (value 32..255 in following byte) | 754 +-------------+---------------------------------------------------+ 755 | 25 | IEEE 754 Half-Precision Float (16 bits follow) | 756 +-------------+---------------------------------------------------+ 757 | 26 | IEEE 754 Single-Precision Float (32 bits follow) | 758 +-------------+---------------------------------------------------+ 759 | 27 | IEEE 754 Double-Precision Float (64 bits follow) | 760 +-------------+---------------------------------------------------+ 761 | 28-30 | Reserved, not well-formed in the present document | 762 +-------------+---------------------------------------------------+ 763 | 31 | "break" stop code for indefinite-length items | 764 | | (Section 3.2.1) | 765 +-------------+---------------------------------------------------+ 767 Table 2: Values for Additional Information in Major Type 7 769 As with all other major types, the 5-bit value 24 signifies a single- 770 byte extension: it is followed by an additional byte to represent the 771 simple value. (To minimize confusion, only the values 32 to 255 are 772 used.) This maintains the structure of the initial bytes: as for the 773 other major types, the length of these always depends on the 774 additional information in the first byte. Table 3 lists the values 775 assigned and available for simple types. 777 +---------+-----------------+ 778 | Value | Semantics | 779 +=========+=================+ 780 | 0..19 | (Unassigned) | 781 +---------+-----------------+ 782 | 20 | False | 783 +---------+-----------------+ 784 | 21 | True | 785 +---------+-----------------+ 786 | 22 | Null | 787 +---------+-----------------+ 788 | 23 | Undefined value | 789 +---------+-----------------+ 790 | 24..31 | (Reserved) | 791 +---------+-----------------+ 792 | 32..255 | (Unassigned) | 793 +---------+-----------------+ 795 Table 3: Simple Values 797 An encoder MUST NOT issue two-byte sequences that start with 0xf8 798 (major type = 7, additional information = 24) and continue with a 799 byte less than 0x20 (32 decimal). Such sequences are not well- 800 formed. (This implies that an encoder cannot encode false, true, 801 null, or undefined in two-byte sequences, only the one-byte variants 802 of these are well-formed.) 804 The 5-bit values of 25, 26, and 27 are for 16-bit, 32-bit, and 64-bit 805 IEEE 754 binary floating-point values [IEEE754]. These floating- 806 point values are encoded in the additional bytes of the appropriate 807 size. (See Appendix D for some information about 16-bit floating 808 point.) 810 3.4. Tagging of Items 812 In CBOR, a data item can be enclosed by a tag to give it additional 813 semantics while retaining its structure. The tag is major type 6, 814 and represents an unsigned integer as indicated by the tag's argument 815 (Section 3); the (sole) enclosed data item is carried as content 816 data. If a tag requires structured data, this structure is encoded 817 into the nested data item. The definition of a tag number usually 818 restricts what kinds of nested data item or items are valid for tags 819 using this tag number. 821 For example, assume that a byte string of length 12 is marked with a 822 tag of number 2 to indicate it is a positive bignum (Section 3.4.3). 823 This would be marked as 0b110_00010 (major type 6, additional 824 information 2 for the tag number) followed by 0b010_01100 (major type 825 2, additional information of 12 for the length) followed by the 12 826 bytes of the bignum. 828 Decoders do not need to understand tags of every tag number, and tags 829 may be of little value in applications where the implementation 830 creating a particular CBOR data item and the implementation decoding 831 that stream know the semantic meaning of each item in the data flow. 832 Their primary purpose in this specification is to define common data 833 types such as dates. A secondary purpose is to provide conversion 834 hints when it is foreseen that the CBOR data item needs to be 835 translated into a different format, requiring hints about the content 836 of items. Understanding the semantics of tags is optional for a 837 decoder; it can just jump over the initial bytes of the tag (that 838 encode the tag number) and interpret the tag content itself, 839 presenting both tag number and tag content to the application. 841 A tag applies semantics to the data item it encloses. Thus, if tag A 842 encloses tag B, which encloses data item C, tag A applies to the 843 result of applying tag B on data item C. That is, a tag is a data 844 item consisting of a tag number and an enclosed value. The content 845 of the tag (the enclosed data item) is the data item (the value) that 846 is being tagged. 848 IANA maintains a registry of tag numbers as described in Section 9.2. 849 Table 4 provides a list of tag numbers that were defined in 850 [RFC7049], with definitions in the rest of this section. Note that 851 many other tag numbers have been defined since the publication of 852 [RFC7049]; see the registry described at Section 9.2 for the complete 853 list. 855 +------------+-------------+----------------------------------+ 856 | Tag Number | Data Item | Semantics | 857 +============+=============+==================================+ 858 | 0 | text string | Standard date/time string; see | 859 | | | Section 3.4.1 | 860 +------------+-------------+----------------------------------+ 861 | 1 | multiple | Epoch-based date/time; see | 862 | | | Section 3.4.2 | 863 +------------+-------------+----------------------------------+ 864 | 2 | byte string | Positive bignum; see | 865 | | | Section 3.4.3 | 866 +------------+-------------+----------------------------------+ 867 | 3 | byte string | Negative bignum; see | 868 | | | Section 3.4.3 | 869 +------------+-------------+----------------------------------+ 870 | 4 | array | Decimal fraction; see | 871 | | | Section 3.4.4 | 872 +------------+-------------+----------------------------------+ 873 | 5 | array | Bigfloat; see Section 3.4.4 | 874 +------------+-------------+----------------------------------+ 875 | 21 | multiple | Expected conversion to base64url | 876 | | | encoding; see Section 3.4.5.2 | 877 +------------+-------------+----------------------------------+ 878 | 22 | multiple | Expected conversion to base64 | 879 | | | encoding; see Section 3.4.5.2 | 880 +------------+-------------+----------------------------------+ 881 | 23 | multiple | Expected conversion to base16 | 882 | | | encoding; see Section 3.4.5.2 | 883 +------------+-------------+----------------------------------+ 884 | 24 | byte string | Encoded CBOR data item; see | 885 | | | Section 3.4.5.1 | 886 +------------+-------------+----------------------------------+ 887 | 32 | text string | URI; see Section 3.4.5.3 | 888 +------------+-------------+----------------------------------+ 889 | 33 | text string | base64url; see Section 3.4.5.3 | 890 +------------+-------------+----------------------------------+ 891 | 34 | text string | base64; see Section 3.4.5.3 | 892 +------------+-------------+----------------------------------+ 893 | 35 | text string | Regular expression; see | 894 | | | Section 3.4.5.3 | 895 +------------+-------------+----------------------------------+ 896 | 36 | text string | MIME message; see | 897 | | | Section 3.4.5.3 | 898 +------------+-------------+----------------------------------+ 899 | 55799 | multiple | Self-described CBOR; see | 900 | | | Section 3.4.6 | 901 +------------+-------------+----------------------------------+ 903 Table 4: Tag numbers defined in RFC 7049 905 Conceptually, tags are interpreted in the generic data model, not at 906 (de-)serialization time. A small number of tags (specifically, tag 907 number 25 and tag number 29) have been registered with semantics that 908 may require processing at (de-)serialization time: The decoder needs 909 to be aware and the encoder needs to be in control of the exact 910 sequence in which data items are encoded into the CBOR data stream. 911 This means these tags cannot be implemented on top of every generic 912 CBOR encoder/decoder (which might not reflect the serialization order 913 for entries in a map at the data model level and vice versa); their 914 implementation therefore typically needs to be integrated into the 915 generic encoder/decoder. The definition of new tags with this 916 property is NOT RECOMMENDED. 918 Protocols using tag numbers 0 and 1 extend the generic data model 919 (Section 2) with data items representing points in time; tag numbers 920 2 and 3, with arbitrarily sized integers; and tag numbers 4 and 5, 921 with floating point values of arbitrary size and precision. 923 3.4.1. Standard Date/Time String 925 Tag number 0 contains a text string in the standard format described 926 by the "date-time" production in [RFC3339], as refined by Section 3.3 927 of [RFC4287], representing the point in time described there. A 928 nested item of another type or that doesn't match the [RFC4287] 929 format is invalid. 931 3.4.2. Epoch-based Date/Time 933 Tag number 1 contains a numerical value counting the number of 934 seconds from 1970-01-01T00:00Z in UTC time to the represented point 935 in civil time. 937 The enclosed item MUST be an unsigned or negative integer (major 938 types 0 and 1), or a floating-point number (major type 7 with 939 additional information 25, 26, or 27). Other contained types are 940 invalid. 942 Non-negative values (major type 0 and non-negative floating-point 943 numbers) stand for time values on or after 1970-01-01T00:00Z UTC and 944 are interpreted according to POSIX [TIME_T]. (POSIX time is also 945 known as UNIX Epoch time. Note that leap seconds are handled 946 specially by POSIX time and this results in a 1 second discontinuity 947 several times per decade.) Note that applications that require the 948 expression of times beyond early 2106 cannot leave out support of 949 64-bit integers for the enclosed value. 951 Negative values (major type 1 and negative floating-point numbers) 952 are interpreted as determined by the application requirements as 953 there is no universal standard for UTC count-of-seconds time before 954 1970-01-01T00:00Z (this is particularly true for points in time that 955 precede discontinuities in national calendars). The same applies to 956 non-finite values. 958 To indicate fractional seconds, floating-point values can be used 959 within tag number 1 instead of integer values. Note that this 960 generally requires binary64 support, as binary16 and binary32 provide 961 non-zero fractions of seconds only for a short period of time around 962 early 1970. An application that requires tag number 1 support may 963 restrict the enclosed value to be an integer (or a floating-point 964 value) only. 966 3.4.3. Bignums 968 Protocols using tag numbers 2 and 3 extend the generic data model 969 (Section 2) with "bignums" representing arbitrarily sized integers. 970 In the generic data model, bignum values are not equal to integers 971 from the basic data model, but specific data models can define that 972 equivalence, and preferred encoding never makes use of bignums that 973 also can be expressed as basic integers (see below). 975 Bignums are encoded as a byte string data item, which is interpreted 976 as an unsigned integer n in network byte order. Contained items of 977 other types are invalid. For tag number 2, the value of the bignum 978 is n. For tag number 3, the value of the bignum is -1 - n. The 979 preferred encoding of the byte string is to leave out any leading 980 zeroes (note that this means the preferred encoding for n = 0 is the 981 empty byte string, but see below). Decoders that understand these 982 tags MUST be able to decode bignums that do have leading zeroes. The 983 preferred encoding of an integer that can be represented using major 984 type 0 or 1 is to encode it this way instead of as a bignum (which 985 means that the empty string never occurs in a bignum when using 986 preferred encoding). Note that this means the non-preferred choice 987 of a bignum representation instead of a basic integer for encoding a 988 number is not intended to have application semantics (just as the 989 choice of a longer basic integer representation than needed, such as 990 0x1800 for 0x00 does not). 992 For example, the number 18446744073709551616 (2**64) is represented 993 as 0b110_00010 (major type 6, tag number 2), followed by 0b010_01001 994 (major type 2, length 9), followed by 0x010000000000000000 (one byte 995 0x01 and eight bytes 0x00). In hexadecimal: 997 C2 -- Tag 2 998 49 -- Byte string of length 9 999 010000000000000000 -- Bytes content 1001 3.4.4. Decimal Fractions and Bigfloats 1003 Protocols using tag number 4 extend the generic data model with data 1004 items representing arbitrary-length decimal fractions of the form 1005 m*(10**e). Protocols using tag number 5 extend the generic data 1006 model with data items representing arbitrary-length binary fractions 1007 of the form m*(2**e). As with bignums, values of different types are 1008 not equal in the generic data model. 1010 Decimal fractions combine an integer mantissa with a base-10 scaling 1011 factor. They are most useful if an application needs the exact 1012 representation of a decimal fraction such as 1.1 because there is no 1013 exact representation for many decimal fractions in binary floating 1014 point. 1016 Bigfloats combine an integer mantissa with a base-2 scaling factor. 1017 They are binary floating-point values that can exceed the range or 1018 the precision of the three IEEE 754 formats supported by CBOR 1019 (Section 3.3). Bigfloats may also be used by constrained 1020 applications that need some basic binary floating-point capability 1021 without the need for supporting IEEE 754. 1023 A decimal fraction or a bigfloat is represented as a tagged array 1024 that contains exactly two integer numbers: an exponent e and a 1025 mantissa m. Decimal fractions (tag number 4) use base-10 exponents; 1026 the value of a decimal fraction data item is m*(10**e). Bigfloats 1027 (tag number 5) use base-2 exponents; the value of a bigfloat data 1028 item is m*(2**e). The exponent e MUST be represented in an integer 1029 of major type 0 or 1, while the mantissa also can be a bignum 1030 (Section 3.4.3). Contained items with other structures are invalid. 1032 An example of a decimal fraction is that the number 273.15 could be 1033 represented as 0b110_00100 (major type of 6 for the tag, additional 1034 information of 4 for the number of tag), followed by 0b100_00010 1035 (major type of 4 for the array, additional information of 2 for the 1036 length of the array), followed by 0b001_00001 (major type of 1 for 1037 the first integer, additional information of 1 for the value of -2), 1038 followed by 0b000_11001 (major type of 0 for the second integer, 1039 additional information of 25 for a two-byte value), followed by 1040 0b0110101010110011 (27315 in two bytes). In hexadecimal: 1042 C4 -- Tag 4 1043 82 -- Array of length 2 1044 21 -- -2 1045 19 6ab3 -- 27315 1047 An example of a bigfloat is that the number 1.5 could be represented 1048 as 0b110_00101 (major type of 6 for the tag, additional information 1049 of 5 for the number of tag), followed by 0b100_00010 (major type of 4 1050 for the array, additional information of 2 for the length of the 1051 array), followed by 0b001_00000 (major type of 1 for the first 1052 integer, additional information of 0 for the value of -1), followed 1053 by 0b000_00011 (major type of 0 for the second integer, additional 1054 information of 3 for the value of 3). In hexadecimal: 1056 C5 -- Tag 5 1057 82 -- Array of length 2 1058 20 -- -1 1059 03 -- 3 1061 Decimal fractions and bigfloats provide no representation of 1062 Infinity, -Infinity, or NaN; if these are needed in place of a 1063 decimal fraction or bigfloat, the IEEE 754 half-precision 1064 representations from Section 3.3 can be used. For constrained 1065 applications, where there is a choice between representing a specific 1066 number as an integer and as a decimal fraction or bigfloat (such as 1067 when the exponent is small and non-negative), there is a quality-of- 1068 implementation expectation that the integer representation is used 1069 directly. 1071 3.4.5. Content Hints 1073 The tags in this section are for content hints that might be used by 1074 generic CBOR processors. These content hints do not extend the 1075 generic data model. 1077 3.4.5.1. Encoded CBOR Data Item 1079 Sometimes it is beneficial to carry an embedded CBOR data item that 1080 is not meant to be decoded immediately at the time the enclosing data 1081 item is being decoded. Tag number 24 (CBOR data item) can be used to 1082 tag the embedded byte string as a data item encoded in CBOR format. 1083 Contained items that aren't byte strings are invalid. A contained 1084 byte string is valid if it encodes a well-formed CBOR item; validity 1085 checking of the decoded CBOR item is not required for tag validity 1086 (but could be offered by a generic decoder as a special option). 1088 3.4.5.2. Expected Later Encoding for CBOR-to-JSON Converters 1090 Tags number 21 to 23 indicate that a byte string might require a 1091 specific encoding when interoperating with a text-based 1092 representation. These tags are useful when an encoder knows that the 1093 byte string data it is writing is likely to be later converted to a 1094 particular JSON-based usage. That usage specifies that some strings 1095 are encoded as base64, base64url, and so on. The encoder uses byte 1096 strings instead of doing the encoding itself to reduce the message 1097 size, to reduce the code size of the encoder, or both. The encoder 1098 does not know whether or not the converter will be generic, and 1099 therefore wants to say what it believes is the proper way to convert 1100 binary strings to JSON. 1102 The data item tagged can be a byte string or any other data item. In 1103 the latter case, the tag applies to all of the byte string data items 1104 contained in the data item, except for those contained in a nested 1105 data item tagged with an expected conversion. 1107 These three tag numbers suggest conversions to three of the base data 1108 encodings defined in [RFC4648]. For base64url encoding (tag number 1109 21), padding is not used (see Section 3.2 of RFC 4648); that is, all 1110 trailing equals signs ("=") are removed from the encoded string. For 1111 base64 encoding (tag number 22), padding is used as defined in RFC 1112 4648. For both base64url and base64, padding bits are set to zero 1113 (see Section 3.5 of RFC 4648), and encoding is performed without the 1114 inclusion of any line breaks, whitespace, or other additional 1115 characters. Note that, for all three tag numbers, the encoding of 1116 the empty byte string is the empty text string. 1118 3.4.5.3. Encoded Text 1120 Some text strings hold data that have formats widely used on the 1121 Internet, and sometimes those formats can be validated and presented 1122 to the application in appropriate form by the decoder. There are 1123 tags for some of these formats. As with tag numbers 21 to 23, if 1124 these tags are applied to an item other than a text string, they 1125 apply to all text string data items it contains. 1127 * Tag number 32 is for URIs, as defined in [RFC3986]. If the text 1128 string doesn't match the "URI-reference" production, the string is 1129 invalid. 1131 * Tag numbers 33 and 34 are for base64url- and base64-encoded text 1132 strings, as defined in [RFC4648]. If any of: 1134 - the encoded text string contains non-alphabet characters or 1135 only 1 character in the last block of 4, or 1137 - the padding bits in a 2- or 3-character block are not 0, or 1139 - the base64 encoding has the wrong number of padding characters, 1140 or 1142 - the base64url encoding has padding characters, 1144 the string is invalid. 1146 * Tag number 35 is for regular expressions that are roughly in Perl 1147 Compatible Regular Expressions (PCRE/PCRE2) form [PCRE] or a 1148 version of the JavaScript regular expression syntax [ECMA262]. 1149 (Note that more specific identification may be necessary if the 1150 actual version of the specification underlying the regular 1151 expression, or more than just the text of the regular expression 1152 itself, need to be conveyed.) Any contained string value is 1153 valid. 1155 * Tag number 36 is for MIME messages (including all headers), as 1156 defined in [RFC2045]. A text string that isn't a valid MIME 1157 message is invalid. (For this tag, validity checking may be 1158 particularly onerous for a generic decoder and might therefore not 1159 be offered. Note that many MIME messages are general binary data 1160 and can therefore not be represented in a text string; 1161 [IANA.cbor-tags] lists a registration for tag number 257 that is 1162 similar to tag number 36 but is used with an enclosed byte 1163 string.) 1165 Note that tag numbers 33 and 34 differ from 21 and 22 in that the 1166 data is transported in base-encoded form for the former and in raw 1167 byte string form for the latter. 1169 3.4.6. Self-Described CBOR 1171 In many applications, it will be clear from the context that CBOR is 1172 being employed for encoding a data item. For instance, a specific 1173 protocol might specify the use of CBOR, or a media type is indicated 1174 that specifies its use. However, there may be applications where 1175 such context information is not available, such as when CBOR data is 1176 stored in a file that does not have disambiguating metadata. Here, 1177 it may help to have some distinguishing characteristics for the data 1178 itself. 1180 Tag number 55799 is defined for this purpose. It does not impart any 1181 special semantics on the data item that it encloses; that is, the 1182 semantics of a data item enclosed in tag number 55799 is exactly 1183 identical to the semantics of the data item itself. 1185 The serialization of this tag's head is 0xd9d9f7, which does not 1186 appear to be in use as a distinguishing mark for any frequently used 1187 file types. In particular, 0xd9d9f7 is not a valid start of a 1188 Unicode text in any Unicode encoding if it is followed by a valid 1189 CBOR data item. 1191 For instance, a decoder might be able to decode both CBOR and JSON. 1192 Such a decoder would need to mechanically distinguish the two 1193 formats. An easy way for an encoder to help the decoder would be to 1194 tag the entire CBOR item with tag number 55799, the serialization of 1195 which will never be found at the beginning of a JSON text. 1197 4. Serialization Considerations 1199 4.1. Preferred Serialization 1201 For some values at the data model level, CBOR provides multiple 1202 serializations. For many applications, it is desirable that an 1203 encoder always chooses a preferred serialization (preferred 1204 encoding); however, the present specification does not put the burden 1205 of enforcing this preference on either encoder or decoder. 1207 Some constrained decoders may be limited in their ability to decode 1208 non-preferred serializations: For example, if only integers below 1209 1_000_000_000 are expected in an application, the decoder may leave 1210 out the code that would be needed to decode 64-bit arguments in 1211 integers. An encoder that always uses preferred serialization 1212 ("preferred encoder") interoperates with this decoder for the numbers 1213 that can occur in this application. More generally speaking, it 1214 therefore can be said that a preferred encoder is more universally 1215 interoperable (and also less wasteful) than one that, say, always 1216 uses 64-bit integers. 1218 Similarly, a constrained encoder may be limited in the variety of 1219 representation variants it supports in such a way that it does not 1220 emit preferred serializations ("variant encoder"): Say, it could be 1221 designed to always use the 32-bit variant for an integer that it 1222 encodes even if a short representation is available (again, assuming 1223 that there is no application need for integers that can only be 1224 represented with the 64-bit variant). A decoder that does not rely 1225 on only ever receiving preferred serializations ("variation-tolerant 1226 decoder") can there be said to be more universally interoperable (it 1227 might very well optimize for the case of receiving preferred 1228 serializations, though). Full implementations of CBOR decoders are 1229 by definition variation-tolerant; the distinction is only relevant if 1230 a constrained implementation of a CBOR decoder meets a variant 1231 encoder. 1233 The preferred serialization always uses the shortest form of 1234 representing the argument (Section 3)); it also uses the shortest 1235 floating-point encoding that preserves the value being encoded (see 1236 Section 5.5). Definite length encoding is preferred whenever the 1237 length is known at the time the serialization of the item starts. 1239 4.2. Deterministically Encoded CBOR 1241 Some protocols may want encoders to only emit CBOR in a particular 1242 deterministic format; those protocols might also have the decoders 1243 check that their input is in that deterministic format. Those 1244 protocols are free to define what they mean by a "deterministic 1245 format" and what encoders and decoders are expected to do. This 1246 section defines a set of restrictions that can serve as the base of 1247 such a deterministic format. 1249 4.2.1. Core Deterministic Encoding Requirements 1251 A CBOR encoding satisfies the "core deterministic encoding 1252 requirements" if it satisfies the following restrictions: 1254 * Preferred serialization MUST be used. In particular, this means 1255 that arguments (see Section 3) for integers, lengths in major 1256 types 2 through 5, and tags MUST be as short as possible, for 1257 instance: 1259 - 0 to 23 and -1 to -24 MUST be expressed in the same byte as the 1260 major type; 1262 - 24 to 255 and -25 to -256 MUST be expressed only with an 1263 additional uint8_t; 1265 - 256 to 65535 and -257 to -65536 MUST be expressed only with an 1266 additional uint16_t; 1268 - 65536 to 4294967295 and -65537 to -4294967296 MUST be expressed 1269 only with an additional uint32_t. 1271 Floating point values also MUST use the shortest form that 1272 preserves the value, e.g. 1.5 is encoded as 0xf93e00 and 1000000.5 1273 as 0xfa49742408. 1275 * Indefinite-length items MUST NOT appear. They can be encoded as 1276 definite-length items instead. 1278 * The keys in every map MUST be sorted in the bytewise lexicographic 1279 order of their deterministic encodings. For example, the 1280 following keys are sorted correctly: 1282 1. 10, encoded as 0x0a. 1284 2. 100, encoded as 0x1864. 1286 3. -1, encoded as 0x20. 1288 4. "z", encoded as 0x617a. 1290 5. "aa", encoded as 0x626161. 1292 6. [100], encoded as 0x811864. 1294 7. [-1], encoded as 0x8120. 1296 8. false, encoded as 0xf4. 1298 4.2.2. Additional Deterministic Encoding Considerations 1300 If a protocol allows for IEEE floats, then additional deterministic 1301 encoding rules might need to be added. One example rule might be to 1302 have all floats start as a 64-bit float, then do a test conversion to 1303 a 32-bit float; if the result is the same numeric value, use the 1304 shorter value and repeat the process with a test conversion to a 1305 16-bit float. (This rule selects 16-bit float for positive and 1306 negative Infinity as well.) Although IEEE floats can represent both 1307 positive and negative zero as distinct values, the application might 1308 not distinguish these and might decide to represent all zero values 1309 with a positive sign, disallowing negative zero. 1311 CBOR tags present additional considerations for deterministic 1312 encoding. If a CBOR-based protocol were to provide the same 1313 semantics for the presence and absence of a specific tag (e.g., by 1314 allowing both tag 1 data items and raw numbers in a date/time 1315 position, treating the latter as if they were tagged), the 1316 deterministic format would not allow them. In a protocol that 1317 requires tags in certain places to obtain specific semantics, the tag 1318 needs to appear in the deterministic format as well. Deterministic 1319 encoding considerations also apply to the content of tags. 1321 Protocols that include floating, big integer, or other complex values 1322 need to define extra requirements on their deterministic encodings. 1323 For example: 1325 * If a protocol includes a field that can express floating-point 1326 values (Section 3.3), the protocol's deterministic encoding needs 1327 to specify whether the integer 1.0 is encoded as 0x01, 0xf93c00, 1328 0xfa3f800000, or 0xfb3ff0000000000000. Three sensible rules for 1329 this are: 1331 1. Encode integral values that fit in 64 bits as values from 1332 major types 0 and 1, and other values as the smallest of 16-, 1333 32-, or 64-bit floating point that accurately represents the 1334 value, 1336 2. Encode all values as the smallest of 16-, 32-, or 64-bit 1337 floating point that accurately represents the value, even for 1338 integral values, or 1340 3. Encode all values as 64-bit floating point. 1342 Rule 1 straddles the boundaries between integers and floating 1343 point values, and Rule 3 does not use preferred encoding, so Rule 1344 2 may be a good choice in many cases. 1346 If NaN is an allowed value and there is no intent to support NaN 1347 payloads or signaling NaNs, the protocol needs to pick a single 1348 representation, for example 0xf97e00. If that simple choice is 1349 not possible, specific attention will be needed for NaN handling. 1351 Subnormal numbers (nonzero numbers with the lowest possible 1352 exponent of a given IEEE 754 number format) may be flushed to zero 1353 outputs or be treated as zero inputs in some floating point 1354 implementations. A protocol's deterministic encoding may want to 1355 exclude them from interchange, interchanging zero instead. 1357 * If a protocol includes a field that can express integers with an 1358 absolute value of 2^64 or larger using tag numbers 2 or 3 1359 (Section 3.4.3), the protocol's deterministic encoding needs to 1360 specify whether small integers are expressed using the tag or 1361 major types 0 and 1. 1363 * A protocol might give encoders the choice of representing a URL as 1364 either a text string or, using Section 3.4.5.3, tag number 32 1365 containing a text string. This protocol's deterministic encoding 1366 needs to either require that the tag is present or require that 1367 it's absent, not allow either one. 1369 4.2.3. Length-first map key ordering 1371 The core deterministic encoding requirements sort map keys in a 1372 different order from the one suggested by Section 3.9 of [RFC7049] 1373 (called "Canonical CBOR" there). Protocols that need to be 1374 compatible with [RFC7049]'s order can instead be specified in terms 1375 of this specification's "length-first core deterministic encoding 1376 requirements": 1378 A CBOR encoding satisfies the "length-first core deterministic 1379 encoding requirements" if it satisfies the core deterministic 1380 encoding requirements except that the keys in every map MUST be 1381 sorted such that: 1383 1. If two keys have different lengths, the shorter one sorts 1384 earlier; 1386 2. If two keys have the same length, the one with the lower value in 1387 (byte-wise) lexical order sorts earlier. 1389 For example, under the length-first core deterministic encoding 1390 requirements, the following keys are sorted correctly: 1392 1. 10, encoded as 0x0a. 1394 2. -1, encoded as 0x20. 1396 3. false, encoded as 0xf4. 1398 4. 100, encoded as 0x1864. 1400 5. "z", encoded as 0x617a. 1402 6. [-1], encoded as 0x8120. 1404 7. "aa", encoded as 0x626161. 1406 8. [100], encoded as 0x811864. 1408 (Although [RFC7049] used the term "Canonical CBOR" for its form of 1409 requirements on deterministic encoding, this document avoids this 1410 term because "canonicalization" is often associated with specific 1411 uses of deterministic encoding only. The terms are essentially 1412 interchangeable, however, and the set of core requirements in this 1413 document could also be called "Canonical CBOR", while the length- 1414 first-ordered version of that could be called "Old Canonical CBOR".) 1416 5. Creating CBOR-Based Protocols 1418 Data formats such as CBOR are often used in environments where there 1419 is no format negotiation. A specific design goal of CBOR is to not 1420 need any included or assumed schema: a decoder can take a CBOR item 1421 and decode it with no other knowledge. 1423 Of course, in real-world implementations, the encoder and the decoder 1424 will have a shared view of what should be in a CBOR data item. For 1425 example, an agreed-to format might be "the item is an array whose 1426 first value is a UTF-8 string, second value is an integer, and 1427 subsequent values are zero or more floating-point numbers" or "the 1428 item is a map that has byte strings for keys and contains at least 1429 one pair whose key is 0xab01". 1431 CBOR-based protocols MUST specify how their decoders handle invalid 1432 and other unexpected data. CBOR-based protocols MAY specify that 1433 they treat arbitrary valid data as unexpected. Encoders for CBOR- 1434 based protocols MUST produce only valid items, that is, the protocol 1435 cannot be designed to make use of invalid items. An encoder can be 1436 capable of encoding as many or as few types of values as is required 1437 by the protocol in which it is used; a decoder can be capable of 1438 understanding as many or as few types of values as is required by the 1439 protocols in which it is used. This lack of restrictions allows CBOR 1440 to be used in extremely constrained environments. 1442 This section discusses some considerations in creating CBOR-based 1443 protocols. With few exceptions, it is advisory only and explicitly 1444 excludes any language from BCP 14 other than words that could be 1445 interpreted as "MAY" in the sense of BCP 14. The exceptions aim at 1446 facilitating interoperability of CBOR-based protocols while making 1447 use of a wide variety of both generic and application-specific 1448 encoders and decoders. 1450 5.1. CBOR in Streaming Applications 1452 In a streaming application, a data stream may be composed of a 1453 sequence of CBOR data items concatenated back-to-back. In such an 1454 environment, the decoder immediately begins decoding a new data item 1455 if data is found after the end of a previous data item. 1457 Not all of the bytes making up a data item may be immediately 1458 available to the decoder; some decoders will buffer additional data 1459 until a complete data item can be presented to the application. 1460 Other decoders can present partial information about a top-level data 1461 item to an application, such as the nested data items that could 1462 already be decoded, or even parts of a byte string that hasn't 1463 completely arrived yet. 1465 Note that some applications and protocols will not want to use 1466 indefinite-length encoding. Using indefinite-length encoding allows 1467 an encoder to not need to marshal all the data for counting, but it 1468 requires a decoder to allocate increasing amounts of memory while 1469 waiting for the end of the item. This might be fine for some 1470 applications but not others. 1472 5.2. Generic Encoders and Decoders 1474 A generic CBOR decoder can decode all well-formed CBOR data and 1475 present them to an application. See Appendix C. 1477 Even though CBOR attempts to minimize these cases, not all well- 1478 formed CBOR data is valid: for example, the encoded text string 1479 "0x62c0ae" does not contain valid UTF-8 and so is not a valid CBOR 1480 item. Also, specific tags may make semantic constraints that may be 1481 violated, such as a bignum tag enclosing another tag, or an instance 1482 of tag number 0 containing a byte string, or containing a text string 1483 with contents that do not match [RFC3339]'s "date-time" production. 1484 There is no requirement that generic encoders and decoders make 1485 unnatural choices for their application interface to enable the 1486 processing of invalid data. Generic encoders and decoders are 1487 expected to forward simple values and tags even if their specific 1488 codepoints are not registered at the time the encoder/decoder is 1489 written (Section 5.4). 1491 Generic decoders provide ways to present well-formed CBOR values, 1492 both valid and invalid, to an application. The diagnostic notation 1493 (Section 8) may be used to present well-formed CBOR values to humans. 1495 Generic encoders provide an application interface that allows the 1496 application to specify any well-formed value, including simple values 1497 and tags unknown to the encoder. 1499 5.3. Validity of Items 1501 A well-formed but invalid CBOR data item presents a problem with 1502 interpreting the data encoded in it in the CBOR data model. A CBOR- 1503 based protocol could be specified in several layers, in which the 1504 lower layers don't process the semantics of some of the CBOR data 1505 they forward. These layers can't notice any validity errors in data 1506 they don't process and MUST forward that data as-is. The first layer 1507 that does process the semantics of an invalid CBOR item MUST take one 1508 of two choices: 1510 1. Replace the problematic item with an error marker and continue 1511 with the next item, or 1513 2. Issue an error and stop processing altogether. 1515 A CBOR-based protocol MUST specify which of these options its 1516 decoders take, for each kind of invalid item they might encounter. 1518 Such problems might occur at the basic validity level of CBOR or in 1519 the context of tags (tag validity). 1521 5.3.1. Basic validity 1523 Two kinds of validity errors can occur in the basic generic data 1524 model: 1526 Duplicate keys in a map: Generic decoders (Section 5.2) make data 1527 available to applications using the native CBOR data model. That 1528 data model includes maps (key-value mappings with unique keys), 1529 not multimaps (key-value mappings where multiple entries can have 1530 the same key). Thus, a generic decoder that gets a CBOR map item 1531 that has duplicate keys will decode to a map with only one 1532 instance of that key, or it might stop processing altogether. On 1533 the other hand, a "streaming decoder" may not even be able to 1534 notice (Section 5.6). 1536 Invalid UTF-8 string: A decoder might or might not want to verify 1537 that the sequence of bytes in a UTF-8 string (major type 3) is 1538 actually valid UTF-8 and react appropriately. 1540 5.3.2. Tag validity 1542 Two additional kinds of validity errors are introduced by adding tags 1543 to the basic generic data model: 1545 Inadmissible type for tag content: Tags (Section 3.4) specify what 1546 type of data item is supposed to be enclosed by the tag; for 1547 example, the tags for positive or negative bignums are supposed to 1548 be put on byte strings. A decoder that decodes the tagged data 1549 item into a native representation (a native big integer in this 1550 example) is expected to check the type of the data item being 1551 tagged. Even decoders that don't have such native representations 1552 available in their environment may perform the check on those tags 1553 known to them and react appropriately. 1555 Inadmissible value for tag content: The type of data item may be 1556 admissible for a tag's content, but the specific value may not be; 1557 e.g., a value of "yesterday" is not acceptable for the content of 1558 tag 0, even though it properly is a text string. A decoder that 1559 normally ingests such tags into equivalent platform types might 1560 present this tag to the application in a similar way to how it 1561 would present a tag with an unknown tag number (Section 5.4). 1563 5.4. Validity and Evolution 1565 A decoder with validity checking will expend the effort to reliably 1566 detect data items with validity errors. For example, such a decoder 1567 needs to have an API that reports an error (and does not return data) 1568 for a CBOR data item that contains any of the validity errors listed 1569 in the previous subsection. 1571 The set of tags defined in the tag registry (Section 9.2), as well as 1572 the set of simple values defined in the simple values registry 1573 (Section 9.1), can grow at any time beyond the set understood by a 1574 generic decoder. A validity-checking decoder can do one of two 1575 things when it encounters such a case that it does not recognize: 1577 * It can report an error (and not return data). Note that this 1578 error is not a validity error per se. This kind of error is more 1579 likely to be raised by a decoder that would be performing validity 1580 checking if this were a known case. 1582 * It can emit the unknown item (type, value, and, for tags, the 1583 decoded tagged data item) to the application calling the decoder, 1584 with an indication that the decoder did not recognize that tag 1585 number or simple value. 1587 The latter approach, which is also appropriate for decoders that do 1588 not support validity checking, provides forward compatibility with 1589 newly registered tags and simple values without the requirement to 1590 update the encoder at the same time as the calling application. (For 1591 this, the API for the decoder needs to have a way to mark unknown 1592 items so that the calling application can handle them in a manner 1593 appropriate for the program.) 1595 Since some of the processing needed for validity checking may have an 1596 appreciable cost (in particular with duplicate detection for maps), 1597 support of validity checking is not a requirement placed on all CBOR 1598 decoders. 1600 Some encoders will rely on their applications to provide input data 1601 in such a way that valid CBOR results from the encoder. A generic 1602 encoder also may want to provide a validity-checking mode where it 1603 reliably limits its output to valid CBOR, independent of whether or 1604 not its application is indeed providing API-conformant data. 1606 5.5. Numbers 1608 CBOR-based protocols should take into account that different language 1609 environments pose different restrictions on the range and precision 1610 of numbers that are representable. For example, the JavaScript 1611 number system treats all numbers as floating point, which may result 1612 in silent loss of precision in decoding integers with more than 53 1613 significant bits. A protocol that uses numbers should define its 1614 expectations on the handling of non-trivial numbers in decoders and 1615 receiving applications. 1617 A CBOR-based protocol that includes floating-point numbers can 1618 restrict which of the three formats (half-precision, single- 1619 precision, and double-precision) are to be supported. For an 1620 integer-only application, a protocol may want to completely exclude 1621 the use of floating-point values. 1623 A CBOR-based protocol designed for compactness may want to exclude 1624 specific integer encodings that are longer than necessary for the 1625 application, such as to save the need to implement 64-bit integers. 1626 There is an expectation that encoders will use the most compact 1627 integer representation that can represent a given value. However, a 1628 compact application should accept values that use a longer-than- 1629 needed encoding (such as encoding "0" as 0b000_11001 followed by two 1630 bytes of 0x00) as long as the application can decode an integer of 1631 the given size. 1633 The preferred encoding for a floating-point value is the shortest 1634 floating-point encoding that preserves its value, e.g., 0xf94580 for 1635 the number 5.5, and 0xfa45ad9c00 for the number 5555.5, unless the 1636 CBOR-based protocol specifically excludes the use of the shorter 1637 floating-point encodings. For NaN values, a shorter encoding is 1638 preferred if zero-padding the shorter significand towards the right 1639 reconstitutes the original NaN value (for many applications, the 1640 single NaN encoding 0xf97e00 will suffice). 1642 5.6. Specifying Keys for Maps 1644 The encoding and decoding applications need to agree on what types of 1645 keys are going to be used in maps. In applications that need to 1646 interwork with JSON-based applications, keys probably should be 1647 limited to UTF-8 strings only; otherwise, there has to be a specified 1648 mapping from the other CBOR types to Unicode characters, and this 1649 often leads to implementation errors. In applications where keys are 1650 numeric in nature and numeric ordering of keys is important to the 1651 application, directly using the numbers for the keys is useful. 1653 If multiple types of keys are to be used, consideration should be 1654 given to how these types would be represented in the specific 1655 programming environments that are to be used. For example, in 1656 JavaScript Maps [ECMA262], a key of integer 1 cannot be distinguished 1657 from a key of floating-point 1.0. This means that, if integer keys 1658 are used, the protocol needs to avoid use of floating-point keys the 1659 values of which happen to be integer numbers in the same map. 1661 Decoders that deliver data items nested within a CBOR data item 1662 immediately on decoding them ("streaming decoders") often do not keep 1663 the state that is necessary to ascertain uniqueness of a key in a 1664 map. Similarly, an encoder that can start encoding data items before 1665 the enclosing data item is completely available ("streaming encoder") 1666 may want to reduce its overhead significantly by relying on its data 1667 source to maintain uniqueness. 1669 A CBOR-based protocol MUST define what to do when a receiving 1670 application does see multiple identical keys in a map. The resulting 1671 rule in the protocol MUST respect the CBOR data model: it cannot 1672 prescribe a specific handling of the entries with the identical keys, 1673 except that it might have a rule that having identical keys in a map 1674 indicates a malformed map and that the decoder has to stop with an 1675 error. Duplicate keys are also prohibited by CBOR decoders that 1676 enforce validity (Section 5.4). 1678 The CBOR data model for maps does not allow ascribing semantics to 1679 the order of the key/value pairs in the map representation. Thus, a 1680 CBOR-based protocol MUST NOT specify that changing the key/value pair 1681 order in a map would change the semantics, except to specify that 1682 some, orders are disallowed, for example where they would not meet 1683 the requirements of a deterministic encoding (Section 4.2). (Any 1684 secondary effects of map ordering such as on timing, cache usage, and 1685 other potential side channels are not considered part of the 1686 semantics but may be enough reason on its own for a protocol to 1687 require a deterministic encoding format.) 1689 Applications for constrained devices that have maps where a small 1690 number of frequently used keys can be identified should consider 1691 using small integers as keys; for instance, a set of 24 or fewer 1692 frequent keys can be encoded in a single byte as unsigned integers, 1693 up to 48 if negative integers are also used. Less frequently 1694 occurring keys can then use integers with longer encodings. 1696 5.6.1. Equivalence of Keys 1698 The specific data model applying to a CBOR data item is used to 1699 determine whether keys occurring in maps are duplicates or distinct. 1701 At the generic data model level, numerically equivalent integer and 1702 floating-point values are distinct from each other, as they are from 1703 the various big numbers (Tags 2 to 5). Similarly, text strings are 1704 distinct from byte strings, even if composed of the same bytes. A 1705 tagged value is distinct from an untagged value or from a value 1706 tagged with a different tag number. 1708 Within each of these groups, numeric values are distinct unless they 1709 are numerically equal (specifically, -0.0 is equal to 0.0); for the 1710 purpose of map key equivalence, NaN (not a number) values are 1711 equivalent if they have the same significand after zero-extending 1712 both significands at the right to 64 bits. 1714 (Byte and text) strings are compared byte by byte, arrays element by 1715 element, and are equal if they have the same number of bytes/elements 1716 and the same values at the same positions. Two maps are equal if 1717 they have the same set of pairs regardless of their order; pairs are 1718 equal if both the key and value are equal. 1720 Tagged values are equal if both the tag number and the enclosed item 1721 are equal. (Note that a generic decoder that provides processing for 1722 a specific tag may not be able to distinguish some semantically 1723 equivalent values, e.g. if leading zeroes occur in the content of tag 1724 2/3 (Section 3.4.3).) Simple values are equal if they simply have 1725 the same value. Nothing else is equal in the generic data model, a 1726 simple value 2 is not equivalent to an integer 2 and an array is 1727 never equivalent to a map. 1729 As discussed in Section 2.2, specific data models can make values 1730 equivalent for the purpose of comparing map keys that are distinct in 1731 the generic data model. Note that this implies that a generic 1732 decoder may deliver a decoded map to an application that needs to be 1733 checked for duplicate map keys by that application (alternatively, 1734 the decoder may provide a programming interface to perform this 1735 service for the application). Specific data models cannot 1736 distinguish values for map keys that are equal for this purpose at 1737 the generic data model level. 1739 5.7. Undefined Values 1741 In some CBOR-based protocols, the simple value (Section 3.3) of 1742 Undefined might be used by an encoder as a substitute for a data item 1743 with an encoding problem, in order to allow the rest of the enclosing 1744 data items to be encoded without harm. 1746 6. Converting Data between CBOR and JSON 1748 This section gives non-normative advice about converting between CBOR 1749 and JSON. Implementations of converters are free to use whichever 1750 advice here they want. 1752 It is worth noting that a JSON text is a sequence of characters, not 1753 an encoded sequence of bytes, while a CBOR data item consists of 1754 bytes, not characters. 1756 6.1. Converting from CBOR to JSON 1758 Most of the types in CBOR have direct analogs in JSON. However, some 1759 do not, and someone implementing a CBOR-to-JSON converter has to 1760 consider what to do in those cases. The following non-normative 1761 advice deals with these by converting them to a single substitute 1762 value, such as a JSON null. 1764 * An integer (major type 0 or 1) becomes a JSON number. 1766 * A byte string (major type 2) that is not embedded in a tag that 1767 specifies a proposed encoding is encoded in base64url without 1768 padding and becomes a JSON string. 1770 * A UTF-8 string (major type 3) becomes a JSON string. Note that 1771 JSON requires escaping certain characters ([RFC8259], Section 7): 1772 quotation mark (U+0022), reverse solidus (U+005C), and the "C0 1773 control characters" (U+0000 through U+001F). All other characters 1774 are copied unchanged into the JSON UTF-8 string. 1776 * An array (major type 4) becomes a JSON array. 1778 * A map (major type 5) becomes a JSON object. This is possible 1779 directly only if all keys are UTF-8 strings. A converter might 1780 also convert other keys into UTF-8 strings (such as by converting 1781 integers into strings containing their decimal representation); 1782 however, doing so introduces a danger of key collision. Note also 1783 that, if tags on UTF-8 strings are ignored as proposed below, this 1784 will cause a key collision if the tags are different but the 1785 strings are the same. 1787 * False (major type 7, additional information 20) becomes a JSON 1788 false. 1790 * True (major type 7, additional information 21) becomes a JSON 1791 true. 1793 * Null (major type 7, additional information 22) becomes a JSON 1794 null. 1796 * A floating-point value (major type 7, additional information 25 1797 through 27) becomes a JSON number if it is finite (that is, it can 1798 be represented in a JSON number); if the value is non-finite (NaN, 1799 or positive or negative Infinity), it is represented by the 1800 substitute value. 1802 * Any other simple value (major type 7, any additional information 1803 value not yet discussed) is represented by the substitute value. 1805 * A bignum (major type 6, tag number 2 or 3) is represented by 1806 encoding its byte string in base64url without padding and becomes 1807 a JSON string. For tag number 3 (negative bignum), a "~" (ASCII 1808 tilde) is inserted before the base-encoded value. (The conversion 1809 to a binary blob instead of a number is to prevent a likely 1810 numeric overflow for the JSON decoder.) 1812 * A byte string with an encoding hint (major type 6, tag number 21 1813 through 23) is encoded as described and becomes a JSON string. 1815 * For all other tags (major type 6, any other tag number), the 1816 enclosed CBOR item is represented as a JSON value; the tag number 1817 is ignored. 1819 * Indefinite-length items are made definite before conversion. 1821 6.2. Converting from JSON to CBOR 1823 All JSON values, once decoded, directly map into one or more CBOR 1824 values. As with any kind of CBOR generation, decisions have to be 1825 made with respect to number representation. In a suggested 1826 conversion: 1828 * JSON numbers without fractional parts (integer numbers) are 1829 represented as integers (major types 0 and 1, possibly major type 1830 6 tag number 2 and 3), choosing the shortest form; integers longer 1831 than an implementation-defined threshold may instead be 1832 represented as floating-point values. The default range that is 1833 represented as integer is -2**53+1..2**53-1 (fully exploiting the 1834 range for exact integers in the binary64 representation often used 1835 for decoding JSON [RFC7493]). A CBOR-based protocol, or a generic 1836 converter implementation, may choose -2**32..2**32-1 or 1837 -2**64..2**64-1 (fully using the integer ranges available in CBOR 1838 with uint32_t or uint64_t, respectively) or even -2**31..2**31-1 1839 or -2**63..2**63-1 (using popular ranges for two's complement 1840 signed integers). (If the JSON was generated from a JavaScript 1841 implementation, its precision is already limited to 53 bits 1842 maximum.) 1844 * Numbers with fractional parts are represented as floating-point 1845 values, performing the decimal-to-binary conversion based on the 1846 precision provided by IEEE 754 binary64. Then, when encoding in 1847 CBOR, the preferred serialization uses the shortest floating-point 1848 representation exactly representing this conversion result; for 1849 instance, 1.5 is represented in a 16-bit floating-point value (not 1850 all implementations will be capable of efficiently finding the 1851 minimum form, though). Instead of using the default binary64 1852 precision, there may be an implementation-defined limit to the 1853 precision of the conversion that will affect the precision of the 1854 represented values. Decimal representation should only be used on 1855 the CBOR side if that is specified in a protocol. 1857 CBOR has been designed to generally provide a more compact encoding 1858 than JSON. One implementation strategy that might come to mind is to 1859 perform a JSON-to-CBOR encoding in place in a single buffer. This 1860 strategy would need to carefully consider a number of pathological 1861 cases, such as that some strings represented with no or very few 1862 escapes and longer (or much longer) than 255 bytes may expand when 1863 encoded as UTF-8 strings in CBOR. Similarly, a few of the binary 1864 floating-point representations might cause expansion from some short 1865 decimal representations (1.1, 1e9) in JSON. This may be hard to get 1866 right, and any ensuing vulnerabilities may be exploited by an 1867 attacker. 1869 7. Future Evolution of CBOR 1871 Successful protocols evolve over time. New ideas appear, 1872 implementation platforms improve, related protocols are developed and 1873 evolve, and new requirements from applications and protocols are 1874 added. Facilitating protocol evolution is therefore an important 1875 design consideration for any protocol development. 1877 For protocols that will use CBOR, CBOR provides some useful 1878 mechanisms to facilitate their evolution. Best practices for this 1879 are well known, particularly from JSON format development of JSON- 1880 based protocols. Therefore, such best practices are outside the 1881 scope of this specification. 1883 However, facilitating the evolution of CBOR itself is very well 1884 within its scope. CBOR is designed to both provide a stable basis 1885 for development of CBOR-based protocols and to be able to evolve. 1886 Since a successful protocol may live for decades, CBOR needs to be 1887 designed for decades of use and evolution. This section provides 1888 some guidance for the evolution of CBOR. It is necessarily more 1889 subjective than other parts of this document. It is also necessarily 1890 incomplete, lest it turn into a textbook on protocol development. 1892 7.1. Extension Points 1894 In a protocol design, opportunities for evolution are often included 1895 in the form of extension points. For example, there may be a 1896 codepoint space that is not fully allocated from the outset, and the 1897 protocol is designed to tolerate and embrace implementations that 1898 start using more codepoints than initially allocated. 1900 Sizing the codepoint space may be difficult because the range 1901 required may be hard to predict. An attempt should be made to make 1902 the codepoint space large enough so that it can slowly be filled over 1903 the intended lifetime of the protocol. 1905 CBOR has three major extension points: 1907 * the "simple" space (values in major type 7). Of the 24 efficient 1908 (and 224 slightly less efficient) values, only a small number have 1909 been allocated. Implementations receiving an unknown simple data 1910 item may be able to process it as such, given that the structure 1911 of the value is indeed simple. The IANA registry in Section 9.1 1912 is the appropriate way to address the extensibility of this 1913 codepoint space. 1915 * the "tag" space (values in major type 6). Again, only a small 1916 part of the codepoint space has been allocated, and the space is 1917 abundant (although the early numbers are more efficient than the 1918 later ones). Implementations receiving an unknown tag number can 1919 choose to simply ignore it or to process it as an unknown tag 1920 number wrapping the enclosed data item. The IANA registry in 1921 Section 9.2 is the appropriate way to address the extensibility of 1922 this codepoint space. 1924 * the "additional information" space. An implementation receiving 1925 an unknown additional information value has no way to continue 1926 decoding, so allocating codepoints to this space is a major step. 1927 There are also very few codepoints left. 1929 7.2. Curating the Additional Information Space 1931 The human mind is sometimes drawn to filling in little perceived gaps 1932 to make something neat. We expect the remaining gaps in the 1933 codepoint space for the additional information values to be an 1934 attractor for new ideas, just because they are there. 1936 The present specification does not manage the additional information 1937 codepoint space by an IANA registry. Instead, allocations out of 1938 this space can only be done by updating this specification. 1940 For an additional information value of n >= 24, the size of the 1941 additional data typically is 2**(n-24) bytes. Therefore, additional 1942 information values 28 and 29 should be viewed as candidates for 1943 128-bit and 256-bit quantities, in case a need arises to add them to 1944 the protocol. Additional information value 30 is then the only 1945 additional information value available for general allocation, and 1946 there should be a very good reason for allocating it before assigning 1947 it through an update of this protocol. 1949 8. Diagnostic Notation 1951 CBOR is a binary interchange format. To facilitate documentation and 1952 debugging, and in particular to facilitate communication between 1953 entities cooperating in debugging, this section defines a simple 1954 human-readable diagnostic notation. All actual interchange always 1955 happens in the binary format. 1957 Note that this truly is a diagnostic format; it is not meant to be 1958 parsed. Therefore, no formal definition (as in ABNF) is given in 1959 this document. (Implementers looking for a text-based format for 1960 representing CBOR data items in configuration files may also want to 1961 consider YAML [YAML].) 1963 The diagnostic notation is loosely based on JSON as it is defined in 1964 RFC 8259, extending it where needed. 1966 The notation borrows the JSON syntax for numbers (integer and 1967 floating point), True (>true<), False (>false<), Null (>null<), UTF-8 1968 strings, arrays, and maps (maps are called objects in JSON; the 1969 diagnostic notation extends JSON here by allowing any data item in 1970 the key position). Undefined is written >undefined< as in 1971 JavaScript. The non-finite floating-point numbers Infinity, 1972 -Infinity, and NaN are written exactly as in this sentence (this is 1973 also a way they can be written in JavaScript, although JSON does not 1974 allow them). A tag is written as an integer number for the tag 1975 number, followed by the tag content in parentheses; for instance, an 1976 RFC 3339 (ISO 8601) date could be notated as: 1978 0("2013-03-21T20:04:00Z") 1980 or the equivalent relative time as 1982 1(1363896240) 1984 Byte strings are notated in one of the base encodings, without 1985 padding, enclosed in single quotes, prefixed by >h< for base16, >b32< 1986 for base32, >h32< for base32hex, >b64< for base64 or base64url (the 1987 actual encodings do not overlap, so the string remains unambiguous). 1988 For example, the byte string 0x12345678 could be written h'12345678', 1989 b32'CI2FM6A', or b64'EjRWeA'. 1991 Unassigned simple values are given as "simple()" with the appropriate 1992 integer in the parentheses. For example, "simple(42)" indicates 1993 major type 7, value 42. 1995 8.1. Encoding Indicators 1997 Sometimes it is useful to indicate in the diagnostic notation which 1998 of several alternative representations were actually used; for 1999 example, a data item written >1.5< by a diagnostic decoder might have 2000 been encoded as a half-, single-, or double-precision float. 2002 The convention for encoding indicators is that anything starting with 2003 an underscore and all following characters that are alphanumeric or 2004 underscore, is an encoding indicator, and can be ignored by anyone 2005 not interested in this information. Encoding indicators are always 2006 optional. 2008 A single underscore can be written after the opening brace of a map 2009 or the opening bracket of an array to indicate that the data item was 2010 represented in indefinite-length format. For example, [_ 1, 2] 2011 contains an indicator that an indefinite-length representation was 2012 used to represent the data item [1, 2]. 2014 An underscore followed by a decimal digit n indicates that the 2015 preceding item (or, for arrays and maps, the item starting with the 2016 preceding bracket or brace) was encoded with an additional 2017 information value of 24+n. For example, 1.5_1 is a half-precision 2018 floating-point number, while 1.5_3 is encoded as double precision. 2019 This encoding indicator is not shown in Appendix A. (Note that the 2020 encoding indicator "_" is thus an abbreviation of the full form "_7", 2021 which is not used.) 2023 As a special case, byte and text strings of indefinite length can be 2024 notated in the form (_ h'0123', h'4567') and (_ "foo", "bar"). 2026 9. IANA Considerations 2028 IANA has created two registries for new CBOR values. The registries 2029 are separate, that is, not under an umbrella registry, and follow the 2030 rules in [RFC8126]. IANA has also assigned a new MIME media type and 2031 an associated Constrained Application Protocol (CoAP) Content-Format 2032 entry. 2034 [To be removed by RFC editor:] IANA is requested to update these 2035 registries to point to the present document instead of RFC 7049. 2037 9.1. Simple Values Registry 2039 IANA has created the "Concise Binary Object Representation (CBOR) 2040 Simple Values" registry at [IANA.cbor-simple-values]. The initial 2041 values are shown in Table 3. 2043 New entries in the range 0 to 19 are assigned by Standards Action. 2044 It is suggested that these Standards Actions allocate values starting 2045 with the number 16 in order to reserve the lower numbers for 2046 contiguous blocks (if any). 2048 New entries in the range 32 to 255 are assigned by Specification 2049 Required. 2051 9.2. Tags Registry 2053 IANA has created the "Concise Binary Object Representation (CBOR) 2054 Tags" registry at [IANA.cbor-tags]. The tags that were defined in 2055 [RFC7049] are described in detail in Section 3.4, but other tags have 2056 already been defined. 2058 New entries in the range 0 to 23 are assigned by Standards Action. 2059 New entries in the range 24 to 255 are assigned by Specification 2060 Required. New entries in the range 256 to 18446744073709551615 are 2061 assigned by First Come First Served. The template for registration 2062 requests is: 2064 * Data item 2066 * Semantics (short form) 2068 In addition, First Come First Served requests should include: 2070 * Point of contact 2072 * Description of semantics (URL) - This description is optional; the 2073 URL can point to something like an Internet-Draft or a web page. 2075 9.3. Media Type ("MIME Type") 2077 The Internet media type [RFC6838] for a single encoded CBOR data item 2078 is application/cbor. 2080 Type name: application 2082 Subtype name: cbor 2084 Required parameters: n/a 2086 Optional parameters: n/a 2088 Encoding considerations: binary 2090 Security considerations: See Section 10 of this document 2092 Interoperability considerations: n/a 2094 Published specification: This document 2096 Applications that use this media type: None yet, but it is expected 2097 that this format will be deployed in protocols and applications. 2099 Additional information: 2100 Magic number(s): n/a 2101 File extension(s): .cbor 2102 Macintosh file type code(s): n/a 2104 Person & email address to contact for further information: 2105 Carsten Bormann 2106 cabo@tzi.org 2108 Intended usage: COMMON 2110 Restrictions on usage: none 2112 Author: 2113 Carsten Bormann 2115 Change controller: 2116 The IESG 2118 9.4. CoAP Content-Format 2120 Media Type: application/cbor 2122 Encoding: - 2123 Id: 60 2125 Reference: [RFCthis] 2127 9.5. The +cbor Structured Syntax Suffix Registration 2129 Name: Concise Binary Object Representation (CBOR) 2131 +suffix: +cbor 2133 References: [RFCthis] 2135 Encoding Considerations: CBOR is a binary format. 2137 Interoperability Considerations: n/a 2139 Fragment Identifier Considerations: 2140 The syntax and semantics of fragment identifiers specified for 2141 +cbor SHOULD be as specified for "application/cbor". (At 2142 publication of this document, there is no fragment identification 2143 syntax defined for "application/cbor".) 2145 The syntax and semantics for fragment identifiers for a specific 2146 "xxx/yyy+cbor" SHOULD be processed as follows: 2148 For cases defined in +cbor, where the fragment identifier resolves 2149 per the +cbor rules, then process as specified in +cbor. 2151 For cases defined in +cbor, where the fragment identifier does 2152 not resolve per the +cbor rules, then process as specified in 2153 "xxx/yyy+cbor". 2155 For cases not defined in +cbor, then process as specified in 2156 "xxx/yyy+cbor". 2158 Security Considerations: See Section 10 of this document 2160 Contact: 2161 Apps Area Working Group (apps-discuss@ietf.org) 2163 Author/Change Controller: 2164 The Apps Area Working Group. 2165 The IESG has change control over this registration. 2167 10. Security Considerations 2169 A network-facing application can exhibit vulnerabilities in its 2170 processing logic for incoming data. Complex parsers are well known 2171 as a likely source of such vulnerabilities, such as the ability to 2172 remotely crash a node, or even remotely execute arbitrary code on it. 2173 CBOR attempts to narrow the opportunities for introducing such 2174 vulnerabilities by reducing parser complexity, by giving the entire 2175 range of encodable values a meaning where possible. 2177 Because CBOR decoders are often used as a first step in processing 2178 unvalidated input, they need to be fully prepared for all types of 2179 hostile input that may be designed to corrupt, overrun, or achieve 2180 control of the system decoding the CBOR data item. A CBOR decoder 2181 needs to assume that all input may be hostile even if it has been 2182 checked by a firewall, has come over a secure channel such as TLS, is 2183 encrypted or signed, or has come from some other source that is 2184 presumed trusted. 2186 Hostile input may be constructed to overrun buffers, overflow or 2187 underflow integer arithmetic, or cause other decoding disruption. 2188 CBOR data items might have lengths or sizes that are intentionally 2189 extremely large or too short. Resource exhaustion attacks might 2190 attempt to lure a decoder into allocating very big data items 2191 (strings, arrays, maps, or even arbitrary precision numbers) or 2192 exhaust the stack depth by setting up deeply nested items. Decoders 2193 need to have appropriate resource management to mitigate these 2194 attacks. (Items for which very large sizes are given can also 2195 attempt to exploit integer overflow vulnerabilities.) 2197 A CBOR decoder, by definition, only accepts well-formed CBOR; this is 2198 the first step to its robustness. Input that is not well-formed CBOR 2199 causes no further processing from the point where the lack of well- 2200 formedness was detected. If possible, any data decoded up to this 2201 point should have no impact on the application using the CBOR 2202 decoder. 2204 In addition to ascertaining well-formedness, a CBOR decoder might 2205 also perform validity checks on the CBOR data. Alternatively, it can 2206 leave those checks to the application using the decoder. This choice 2207 needs to be clearly documented in the decoder. Beyond the validity 2208 at the CBOR level, an application also needs to ascertain that the 2209 input is in alignment with the application protocol that is 2210 serialized in CBOR. 2212 The input check itself may consume resources. This is usually linear 2213 in the size of the input, which means that an attacker has to spend 2214 resources that are commensurate to the resources spent by the 2215 defender on input validation. Processing for arbitrary-precision 2216 numbers may exceed linear effort. Also, some hash-table 2217 implementations that are used by decoders to build in-memory 2218 representations of maps can be attacked to spend quadratic effort, 2219 unless a secret key is employed (see Section 7 of [SIPHASH]). Such 2220 superlinear efforts can be employed by an attacker to exhaust 2221 resources at or before the input validator; they therefore need to be 2222 avoided in a CBOR decoder implementation. Note that tag number 2223 definitions and their implementations can add security considerations 2224 of this kind; this should then be discussed in the security 2225 considerations of the tag number definition. 2227 CBOR encoders do not receive input directly from the network and are 2228 thus not directly attackable in the same way as CBOR decoders. 2229 However, CBOR encoders often have an API that takes input from 2230 another level in the implementation and can be attacked through that 2231 API. The design and implementation of that API should assume the 2232 behavior of its caller may be based on hostile input or on coding 2233 mistakes. It should check inputs for buffer overruns, overflow and 2234 underflow of integer arithmetic, and other such errors that are aimed 2235 to disrupt the encoder. 2237 Protocols should be defined in such a way that potential multiple 2238 interpretations are reliably reduced to a single interpretation. For 2239 example, an attacker could make use of invalid input such as 2240 duplicate keys in maps, or exploit different precision in processing 2241 numbers to make one application base its decisions on a different 2242 interpretation than the one that will be used by a second 2243 application. To facilitate consistent interpretation, encoder and 2244 decoder implementations should provide a validity checking mode of 2245 operation (Section 5.4). Note, however, that a generic decoder 2246 cannot know about all requirements that an application poses on its 2247 input data; it is therefore not relieving the application from 2248 performing its own input checking. Also, since the set of defined 2249 tag numbers evolves, the application may employ a tag number that is 2250 not yet supported for validity checking by the generic decoder it 2251 uses. Generic decoders therefore need to provide documentation which 2252 tag numbers they support and what validity checking they can provide 2253 for each of them as well as for basic CBOR validity (UTF-8 checking, 2254 duplicate map key checking). 2256 11. References 2258 11.1. Normative References 2260 [ECMA262] Ecma International, "ECMAScript 2018 Language 2261 Specification", ECMA Standard ECMA-262, 9th Edition, June 2262 2018, . 2266 [IEEE754] IEEE, "IEEE Standard for Floating-Point Arithmetic", IEEE 2267 Std 754-2008. 2269 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 2270 Extensions (MIME) Part One: Format of Internet Message 2271 Bodies", RFC 2045, DOI 10.17487/RFC2045, November 1996, 2272 . 2274 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2275 Requirement Levels", BCP 14, RFC 2119, 2276 DOI 10.17487/RFC2119, March 1997, 2277 . 2279 [RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet: 2280 Timestamps", RFC 3339, DOI 10.17487/RFC3339, July 2002, 2281 . 2283 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 2284 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 2285 2003, . 2287 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 2288 Resource Identifier (URI): Generic Syntax", STD 66, 2289 RFC 3986, DOI 10.17487/RFC3986, January 2005, 2290 . 2292 [RFC4287] Nottingham, M., Ed. and R. Sayre, Ed., "The Atom 2293 Syndication Format", RFC 4287, DOI 10.17487/RFC4287, 2294 December 2005, . 2296 [RFC4648] Josefsson, S., "The Base16, Base32, and Base64 Data 2297 Encodings", RFC 4648, DOI 10.17487/RFC4648, October 2006, 2298 . 2300 [RFC8126] Cotton, M., Leiba, B., and T. Narten, "Guidelines for 2301 Writing an IANA Considerations Section in RFCs", BCP 26, 2302 RFC 8126, DOI 10.17487/RFC8126, June 2017, 2303 . 2305 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2306 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 2307 May 2017, . 2309 [TIME_T] The Open Group Base Specifications, "Vol. 1: Base 2310 Definitions, Issue 7", 2013 Edition, IEEE Std 1003.1, 2311 Section 4.15 'Seconds Since the Epoch', 2013, 2312 . 2315 11.2. Informative References 2317 [ASN.1] International Telecommunication Union, "Information 2318 Technology -- ASN.1 encoding rules: Specification of Basic 2319 Encoding Rules (BER), Canonical Encoding Rules (CER) and 2320 Distinguished Encoding Rules (DER)", ITU-T Recommendation 2321 X.690, 1994. 2323 [BSON] Various, "BSON - Binary JSON", 2013, 2324 . 2326 [I-D.ietf-cbor-sequence] 2327 Bormann, C., "Concise Binary Object Representation (CBOR) 2328 Sequences", Work in Progress, Internet-Draft, draft-ietf- 2329 cbor-sequence-02, 25 September 2019, . 2332 [IANA.cbor-simple-values] 2333 IANA, "Concise Binary Object Representation (CBOR) Simple 2334 Values", 2335 . 2337 [IANA.cbor-tags] 2338 IANA, "Concise Binary Object Representation (CBOR) Tags", 2339 . 2341 [MessagePack] 2342 Furuhashi, S., "MessagePack", 2013, . 2344 [PCRE] Ho, A., "PCRE - Perl Compatible Regular Expressions", 2345 2018, . 2347 [RFC0713] Haverty, J., "MSDTP-Message Services Data Transmission 2348 Protocol", RFC 713, DOI 10.17487/RFC0713, April 1976, 2349 . 2351 [RFC6838] Freed, N., Klensin, J., and T. Hansen, "Media Type 2352 Specifications and Registration Procedures", BCP 13, 2353 RFC 6838, DOI 10.17487/RFC6838, January 2013, 2354 . 2356 [RFC7049] Bormann, C. and P. Hoffman, "Concise Binary Object 2357 Representation (CBOR)", RFC 7049, DOI 10.17487/RFC7049, 2358 October 2013, . 2360 [RFC7228] Bormann, C., Ersue, M., and A. Keranen, "Terminology for 2361 Constrained-Node Networks", RFC 7228, 2362 DOI 10.17487/RFC7228, May 2014, 2363 . 2365 [RFC7493] Bray, T., Ed., "The I-JSON Message Format", RFC 7493, 2366 DOI 10.17487/RFC7493, March 2015, 2367 . 2369 [RFC8259] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data 2370 Interchange Format", STD 90, RFC 8259, 2371 DOI 10.17487/RFC8259, December 2017, 2372 . 2374 [RFC8618] Dickinson, J., Hague, J., Dickinson, S., Manderson, T., 2375 and J. Bond, "Compacted-DNS (C-DNS): A Format for DNS 2376 Packet Capture", RFC 8618, DOI 10.17487/RFC8618, September 2377 2019, . 2379 [SIPHASH] Aumasson, J. and D. Bernstein, "SipHash: A Fast Short- 2380 Input PRF", DOI 10.1007/978-3-642-34931-7_28, Lecture 2381 Notes in Computer Science pp. 489-508, 2012, 2382 . 2384 [YAML] Ben-Kiki, O., Evans, C., and I.d. Net, "YAML Ain't Markup 2385 Language (YAML[TM]) Version 1.2", 3rd Edition, October 2386 2009, . 2388 Appendix A. Examples 2390 The following table provides some CBOR-encoded values in hexadecimal 2391 (right column), together with diagnostic notation for these values 2392 (left column). Note that the string "\u00fc" is one form of 2393 diagnostic notation for a UTF-8 string containing the single Unicode 2394 character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (u umlaut). 2395 Similarly, "\u6c34" is a UTF-8 string in diagnostic notation with a 2396 single character U+6C34 (CJK UNIFIED IDEOGRAPH-6C34, often 2397 representing "water"), and "\ud800\udd51" is a UTF-8 string in 2398 diagnostic notation with a single character U+10151 (GREEK ACROPHONIC 2399 ATTIC FIFTY STATERS). (Note that all these single-character strings 2400 could also be represented in native UTF-8 in diagnostic notation, 2401 just not in an ASCII-only specification like the present one.) In 2402 the diagnostic notation provided for bignums, their intended numeric 2403 value is shown as a decimal number (such as 18446744073709551616) 2404 instead of showing a tagged byte string (such as 2405 2(h'010000000000000000')). 2407 +------------------------------+------------------------------------+ 2408 | Diagnostic | Encoded | 2409 +==============================+====================================+ 2410 | 0 | 0x00 | 2411 +------------------------------+------------------------------------+ 2412 | 1 | 0x01 | 2413 +------------------------------+------------------------------------+ 2414 | 10 | 0x0a | 2415 +------------------------------+------------------------------------+ 2416 | 23 | 0x17 | 2417 +------------------------------+------------------------------------+ 2418 | 24 | 0x1818 | 2419 +------------------------------+------------------------------------+ 2420 | 25 | 0x1819 | 2421 +------------------------------+------------------------------------+ 2422 | 100 | 0x1864 | 2423 +------------------------------+------------------------------------+ 2424 | 1000 | 0x1903e8 | 2425 +------------------------------+------------------------------------+ 2426 | 1000000 | 0x1a000f4240 | 2427 +------------------------------+------------------------------------+ 2428 | 1000000000000 | 0x1b000000e8d4a51000 | 2429 +------------------------------+------------------------------------+ 2430 | 18446744073709551615 | 0x1bffffffffffffffff | 2431 +------------------------------+------------------------------------+ 2432 | 18446744073709551616 | 0xc249010000000000000000 | 2433 +------------------------------+------------------------------------+ 2434 | -18446744073709551616 | 0x3bffffffffffffffff | 2435 +------------------------------+------------------------------------+ 2436 | -18446744073709551617 | 0xc349010000000000000000 | 2437 +------------------------------+------------------------------------+ 2438 | -1 | 0x20 | 2439 +------------------------------+------------------------------------+ 2440 | -10 | 0x29 | 2441 +------------------------------+------------------------------------+ 2442 | -100 | 0x3863 | 2443 +------------------------------+------------------------------------+ 2444 | -1000 | 0x3903e7 | 2445 +------------------------------+------------------------------------+ 2446 | 0.0 | 0xf90000 | 2447 +------------------------------+------------------------------------+ 2448 | -0.0 | 0xf98000 | 2449 +------------------------------+------------------------------------+ 2450 | 1.0 | 0xf93c00 | 2451 +------------------------------+------------------------------------+ 2452 | 1.1 | 0xfb3ff199999999999a | 2453 +------------------------------+------------------------------------+ 2454 | 1.5 | 0xf93e00 | 2455 +------------------------------+------------------------------------+ 2456 | 65504.0 | 0xf97bff | 2457 +------------------------------+------------------------------------+ 2458 | 100000.0 | 0xfa47c35000 | 2459 +------------------------------+------------------------------------+ 2460 | 3.4028234663852886e+38 | 0xfa7f7fffff | 2461 +------------------------------+------------------------------------+ 2462 | 1.0e+300 | 0xfb7e37e43c8800759c | 2463 +------------------------------+------------------------------------+ 2464 | 5.960464477539063e-8 | 0xf90001 | 2465 +------------------------------+------------------------------------+ 2466 | 0.00006103515625 | 0xf90400 | 2467 +------------------------------+------------------------------------+ 2468 | -4.0 | 0xf9c400 | 2469 +------------------------------+------------------------------------+ 2470 | -4.1 | 0xfbc010666666666666 | 2471 +------------------------------+------------------------------------+ 2472 | Infinity | 0xf97c00 | 2473 +------------------------------+------------------------------------+ 2474 | NaN | 0xf97e00 | 2475 +------------------------------+------------------------------------+ 2476 | -Infinity | 0xf9fc00 | 2477 +------------------------------+------------------------------------+ 2478 | Infinity | 0xfa7f800000 | 2479 +------------------------------+------------------------------------+ 2480 | NaN | 0xfa7fc00000 | 2481 +------------------------------+------------------------------------+ 2482 | -Infinity | 0xfaff800000 | 2483 +------------------------------+------------------------------------+ 2484 | Infinity | 0xfb7ff0000000000000 | 2485 +------------------------------+------------------------------------+ 2486 | NaN | 0xfb7ff8000000000000 | 2487 +------------------------------+------------------------------------+ 2488 | -Infinity | 0xfbfff0000000000000 | 2489 +------------------------------+------------------------------------+ 2490 | false | 0xf4 | 2491 +------------------------------+------------------------------------+ 2492 | true | 0xf5 | 2493 +------------------------------+------------------------------------+ 2494 | null | 0xf6 | 2495 +------------------------------+------------------------------------+ 2496 | undefined | 0xf7 | 2497 +------------------------------+------------------------------------+ 2498 | simple(16) | 0xf0 | 2499 +------------------------------+------------------------------------+ 2500 | simple(255) | 0xf8ff | 2501 +------------------------------+------------------------------------+ 2502 | 0("2013-03-21T20:04:00Z") | 0xc074323031332d30332d32315432303a | 2503 | | 30343a30305a | 2504 +------------------------------+------------------------------------+ 2505 | 1(1363896240) | 0xc11a514b67b0 | 2506 +------------------------------+------------------------------------+ 2507 | 1(1363896240.5) | 0xc1fb41d452d9ec200000 | 2508 +------------------------------+------------------------------------+ 2509 | 23(h'01020304') | 0xd74401020304 | 2510 +------------------------------+------------------------------------+ 2511 | 24(h'6449455446') | 0xd818456449455446 | 2512 +------------------------------+------------------------------------+ 2513 | 32("http://www.example.com") | 0xd82076687474703a2f2f7777772e6578 | 2514 | | 616d706c652e636f6d | 2515 +------------------------------+------------------------------------+ 2516 | h'' | 0x40 | 2517 +------------------------------+------------------------------------+ 2518 | h'01020304' | 0x4401020304 | 2519 +------------------------------+------------------------------------+ 2520 | "" | 0x60 | 2521 +------------------------------+------------------------------------+ 2522 | "a" | 0x6161 | 2523 +------------------------------+------------------------------------+ 2524 | "IETF" | 0x6449455446 | 2525 +------------------------------+------------------------------------+ 2526 | "\"\\" | 0x62225c | 2527 +------------------------------+------------------------------------+ 2528 | "\u00fc" | 0x62c3bc | 2529 +------------------------------+------------------------------------+ 2530 | "\u6c34" | 0x63e6b0b4 | 2531 +------------------------------+------------------------------------+ 2532 | "\ud800\udd51" | 0x64f0908591 | 2533 +------------------------------+------------------------------------+ 2534 | [] | 0x80 | 2535 +------------------------------+------------------------------------+ 2536 | [1, 2, 3] | 0x83010203 | 2537 +------------------------------+------------------------------------+ 2538 | [1, [2, 3], [4, 5]] | 0x8301820203820405 | 2539 +------------------------------+------------------------------------+ 2540 | [1, 2, 3, 4, 5, 6, 7, 8, 9, | 0x98190102030405060708090a0b0c0d0e | 2541 | 10, 11, 12, 13, 14, 15, 16, | 0f101112131415161718181819 | 2542 | 17, 18, 19, 20, 21, 22, 23, | | 2543 | 24, 25] | | 2544 +------------------------------+------------------------------------+ 2545 | {} | 0xa0 | 2546 +------------------------------+------------------------------------+ 2547 | {1: 2, 3: 4} | 0xa201020304 | 2548 +------------------------------+------------------------------------+ 2549 | {"a": 1, "b": [2, 3]} | 0xa26161016162820203 | 2550 +------------------------------+------------------------------------+ 2551 | ["a", {"b": "c"}] | 0x826161a161626163 | 2552 +------------------------------+------------------------------------+ 2553 |{"a": "A", "b": "B", "c": "C",| 0xa5616161416162614261636143616461 | 2554 | "d": "D", "e": "E"} | 4461656145 | 2555 +------------------------------+------------------------------------+ 2556 | (_ h'0102', h'030405') | 0x5f42010243030405ff | 2557 +------------------------------+------------------------------------+ 2558 | (_ "strea", "ming") | 0x7f657374726561646d696e67ff | 2559 +------------------------------+------------------------------------+ 2560 | [_ ] | 0x9fff | 2561 +------------------------------+------------------------------------+ 2562 | [_ 1, [2, 3], [_ 4, 5]] | 0x9f018202039f0405ffff | 2563 +------------------------------+------------------------------------+ 2564 | [_ 1, [2, 3], [4, 5]] | 0x9f01820203820405ff | 2565 +------------------------------+------------------------------------+ 2566 | [1, [2, 3], [_ 4, 5]] | 0x83018202039f0405ff | 2567 +------------------------------+------------------------------------+ 2568 | [1, [_ 2, 3], [4, 5]] | 0x83019f0203ff820405 | 2569 +------------------------------+------------------------------------+ 2570 |[_ 1, 2, 3, 4, 5, 6, 7, 8, 9, | 0x9f0102030405060708090a0b0c0d0e0f | 2571 | 10, 11, 12, 13, 14, 15, 16, | 101112131415161718181819ff | 2572 | 17, 18, 19, 20, 21, 22, 23, | | 2573 | 24, 25] | | 2574 +------------------------------+------------------------------------+ 2575 | {_ "a": 1, "b": [_ 2, 3]} | 0xbf61610161629f0203ffff | 2576 +------------------------------+------------------------------------+ 2577 | ["a", {_ "b": "c"}] | 0x826161bf61626163ff | 2578 +------------------------------+------------------------------------+ 2579 | {_ "Fun": true, "Amt": -2} | 0xbf6346756ef563416d7421ff | 2580 +------------------------------+------------------------------------+ 2582 Table 5: Examples of Encoded CBOR Data Items 2584 Appendix B. Jump Table 2586 For brevity, this jump table does not show initial bytes that are 2587 reserved for future extension. It also only shows a selection of the 2588 initial bytes that can be used for optional features. (All unsigned 2589 integers are in network byte order.) 2591 +------------+------------------------------------------------+ 2592 | Byte | Structure/Semantics | 2593 +============+================================================+ 2594 | 0x00..0x17 | Unsigned integer 0x00..0x17 (0..23) | 2595 +------------+------------------------------------------------+ 2596 | 0x18 | Unsigned integer (one-byte uint8_t follows) | 2597 +------------+------------------------------------------------+ 2598 | 0x19 | Unsigned integer (two-byte uint16_t follows) | 2599 +------------+------------------------------------------------+ 2600 | 0x1a | Unsigned integer (four-byte uint32_t follows) | 2601 +------------+------------------------------------------------+ 2602 | 0x1b | Unsigned integer (eight-byte uint64_t follows) | 2603 +------------+------------------------------------------------+ 2604 | 0x20..0x37 | Negative integer -1-0x00..-1-0x17 (-1..-24) | 2605 +------------+------------------------------------------------+ 2606 | 0x38 | Negative integer -1-n (one-byte uint8_t for n | 2607 | | follows) | 2608 +------------+------------------------------------------------+ 2609 | 0x39 | Negative integer -1-n (two-byte uint16_t for n | 2610 | | follows) | 2611 +------------+------------------------------------------------+ 2612 | 0x3a | Negative integer -1-n (four-byte uint32_t for | 2613 | | n follows) | 2614 +------------+------------------------------------------------+ 2615 | 0x3b | Negative integer -1-n (eight-byte uint64_t for | 2616 | | n follows) | 2617 +------------+------------------------------------------------+ 2618 | 0x40..0x57 | byte string (0x00..0x17 bytes follow) | 2619 +------------+------------------------------------------------+ 2620 | 0x58 | byte string (one-byte uint8_t for n, and then | 2621 | | n bytes follow) | 2622 +------------+------------------------------------------------+ 2623 | 0x59 | byte string (two-byte uint16_t for n, and then | 2624 | | n bytes follow) | 2625 +------------+------------------------------------------------+ 2626 | 0x5a | byte string (four-byte uint32_t for n, and | 2627 | | then n bytes follow) | 2628 +------------+------------------------------------------------+ 2629 | 0x5b | byte string (eight-byte uint64_t for n, and | 2630 | | then n bytes follow) | 2631 +------------+------------------------------------------------+ 2632 | 0x5f | byte string, byte strings follow, terminated | 2633 | | by "break" | 2634 +------------+------------------------------------------------+ 2635 | 0x60..0x77 | UTF-8 string (0x00..0x17 bytes follow) | 2636 +------------+------------------------------------------------+ 2637 | 0x78 | UTF-8 string (one-byte uint8_t for n, and then | 2638 | | n bytes follow) | 2639 +------------+------------------------------------------------+ 2640 | 0x79 | UTF-8 string (two-byte uint16_t for n, and | 2641 | | then n bytes follow) | 2642 +------------+------------------------------------------------+ 2643 | 0x7a | UTF-8 string (four-byte uint32_t for n, and | 2644 | | then n bytes follow) | 2645 +------------+------------------------------------------------+ 2646 | 0x7b | UTF-8 string (eight-byte uint64_t for n, and | 2647 | | then n bytes follow) | 2648 +------------+------------------------------------------------+ 2649 | 0x7f | UTF-8 string, UTF-8 strings follow, terminated | 2650 | | by "break" | 2651 +------------+------------------------------------------------+ 2652 | 0x80..0x97 | array (0x00..0x17 data items follow) | 2653 +------------+------------------------------------------------+ 2654 | 0x98 | array (one-byte uint8_t for n, and then n data | 2655 | | items follow) | 2656 +------------+------------------------------------------------+ 2657 | 0x99 | array (two-byte uint16_t for n, and then n | 2658 | | data items follow) | 2659 +------------+------------------------------------------------+ 2660 | 0x9a | array (four-byte uint32_t for n, and then n | 2661 | | data items follow) | 2662 +------------+------------------------------------------------+ 2663 | 0x9b | array (eight-byte uint64_t for n, and then n | 2664 | | data items follow) | 2665 +------------+------------------------------------------------+ 2666 | 0x9f | array, data items follow, terminated by | 2667 | | "break" | 2668 +------------+------------------------------------------------+ 2669 | 0xa0..0xb7 | map (0x00..0x17 pairs of data items follow) | 2670 +------------+------------------------------------------------+ 2671 | 0xb8 | map (one-byte uint8_t for n, and then n pairs | 2672 | | of data items follow) | 2673 +------------+------------------------------------------------+ 2674 | 0xb9 | map (two-byte uint16_t for n, and then n pairs | 2675 | | of data items follow) | 2676 +------------+------------------------------------------------+ 2677 | 0xba | map (four-byte uint32_t for n, and then n | 2678 | | pairs of data items follow) | 2679 +------------+------------------------------------------------+ 2680 | 0xbb | map (eight-byte uint64_t for n, and then n | 2681 | | pairs of data items follow) | 2682 +------------+------------------------------------------------+ 2683 | 0xbf | map, pairs of data items follow, terminated by | 2684 | | "break" | 2685 +------------+------------------------------------------------+ 2686 | 0xc0 | Text-based date/time (data item follows; see | 2687 | | Section 3.4.1) | 2688 +------------+------------------------------------------------+ 2689 | 0xc1 | Epoch-based date/time (data item follows; see | 2690 | | Section 3.4.2) | 2691 +------------+------------------------------------------------+ 2692 | 0xc2 | Positive bignum (data item "byte string" | 2693 | | follows) | 2694 +------------+------------------------------------------------+ 2695 | 0xc3 | Negative bignum (data item "byte string" | 2696 | | follows) | 2697 +------------+------------------------------------------------+ 2698 | 0xc4 | Decimal Fraction (data item "array" follows; | 2699 | | see Section 3.4.4) | 2700 +------------+------------------------------------------------+ 2701 | 0xc5 | Bigfloat (data item "array" follows; see | 2702 | | Section 3.4.4) | 2703 +------------+------------------------------------------------+ 2704 | 0xc6..0xd4 | (tag) | 2705 +------------+------------------------------------------------+ 2706 | 0xd5..0xd7 | Expected Conversion (data item follows; see | 2707 | | Section 3.4.5.2) | 2708 +------------+------------------------------------------------+ 2709 | 0xd8..0xdb | (more tags, 1/2/4/8 bytes and then a data item | 2710 | | follow) | 2711 +------------+------------------------------------------------+ 2712 | 0xe0..0xf3 | (simple value) | 2713 +------------+------------------------------------------------+ 2714 | 0xf4 | False | 2715 +------------+------------------------------------------------+ 2716 | 0xf5 | True | 2717 +------------+------------------------------------------------+ 2718 | 0xf6 | Null | 2719 +------------+------------------------------------------------+ 2720 | 0xf7 | Undefined | 2721 +------------+------------------------------------------------+ 2722 | 0xf8 | (simple value, one byte follows) | 2723 +------------+------------------------------------------------+ 2724 | 0xf9 | Half-Precision Float (two-byte IEEE 754) | 2725 +------------+------------------------------------------------+ 2726 | 0xfa | Single-Precision Float (four-byte IEEE 754) | 2727 +------------+------------------------------------------------+ 2728 | 0xfb | Double-Precision Float (eight-byte IEEE 754) | 2729 +------------+------------------------------------------------+ 2730 | 0xff | "break" stop code | 2731 +------------+------------------------------------------------+ 2733 Table 6: Jump Table for Initial Byte 2735 Appendix C. Pseudocode 2737 The well-formedness of a CBOR item can be checked by the pseudocode 2738 in Figure 1. The data is well-formed if and only if: 2740 * the pseudocode does not "fail"; 2741 * after execution of the pseudocode, no bytes are left in the input 2742 (except in streaming applications) 2744 The pseudocode has the following prerequisites: 2746 * take(n) reads n bytes from the input data and returns them as a 2747 byte string. If n bytes are no longer available, take(n) fails. 2749 * uint() converts a byte string into an unsigned integer by 2750 interpreting the byte string in network byte order. 2752 * Arithmetic works as in C. 2754 * All variables are unsigned integers of sufficient range. 2756 Note that "well_formed" returns the major type for well-formed 2757 definite length items, but 0 for an indefinite length item (or -1 for 2758 a break stop code, only if "breakable" is set). This is used in 2759 "well_formed_indefinite" to ascertain that indefinite length strings 2760 only contain definite length strings as chunks. 2762 well_formed (breakable = false) { 2763 // process initial bytes 2764 ib = uint(take(1)); 2765 mt = ib >> 5; 2766 val = ai = ib & 0x1f; 2767 switch (ai) { 2768 case 24: val = uint(take(1)); break; 2769 case 25: val = uint(take(2)); break; 2770 case 26: val = uint(take(4)); break; 2771 case 27: val = uint(take(8)); break; 2772 case 28: case 29: case 30: fail(); 2773 case 31: 2774 return well_formed_indefinite(mt, breakable); 2775 } 2776 // process content 2777 switch (mt) { 2778 // case 0, 1, 7 do not have content; just use val 2779 case 2: case 3: take(val); break; // bytes/UTF-8 2780 case 4: for (i = 0; i < val; i++) well_formed(); break; 2781 case 5: for (i = 0; i < val*2; i++) well_formed(); break; 2782 case 6: well_formed(); break; // 1 embedded data item 2783 case 7: if (ai == 24 && val < 32) fail(); // bad simple 2784 } 2785 return mt; // finite data item 2786 } 2788 well_formed_indefinite(mt, breakable) { 2789 switch (mt) { 2790 case 2: case 3: 2791 while ((it = well_formed(true)) != -1) 2792 if (it != mt) // need finite-length chunk 2793 fail(); // of same type 2794 break; 2795 case 4: while (well_formed(true) != -1); break; 2796 case 5: while (well_formed(true) != -1) well_formed(); break; 2797 case 7: 2798 if (breakable) 2799 return -1; // signal break out 2800 else fail(); // no enclosing indefinite 2801 default: fail(); // wrong mt 2802 } 2803 return 0; // no break out 2804 } 2806 Figure 1: Pseudocode for Well-Formedness Check 2808 Note that the remaining complexity of a complete CBOR decoder is 2809 about presenting data that has been decoded to the application in an 2810 appropriate form. 2812 Major types 0 and 1 are designed in such a way that they can be 2813 encoded in C from a signed integer without actually doing an if-then- 2814 else for positive/negative (Figure 2). This uses the fact that 2815 (-1-n), the transformation for major type 1, is the same as ~n 2816 (bitwise complement) in C unsigned arithmetic; ~n can then be 2817 expressed as (-1)^n for the negative case, while 0^n leaves n 2818 unchanged for non-negative. The sign of a number can be converted to 2819 -1 for negative and 0 for non-negative (0 or positive) by arithmetic- 2820 shifting the number by one bit less than the bit length of the number 2821 (for example, by 63 for 64-bit numbers). 2823 void encode_sint(int64_t n) { 2824 uint64t ui = n >> 63; // extend sign to whole length 2825 mt = ui & 0x20; // extract major type 2826 ui ^= n; // complement negatives 2827 if (ui < 24) 2828 *p++ = mt + ui; 2829 else if (ui < 256) { 2830 *p++ = mt + 24; 2831 *p++ = ui; 2832 } else 2833 ... 2835 Figure 2: Pseudocode for Encoding a Signed Integer 2837 Appendix D. Half-Precision 2839 As half-precision floating-point numbers were only added to IEEE 754 2840 in 2008 [IEEE754], today's programming platforms often still only 2841 have limited support for them. It is very easy to include at least 2842 decoding support for them even without such support. An example of a 2843 small decoder for half-precision floating-point numbers in the C 2844 language is shown in Figure 3. A similar program for Python is in 2845 Figure 4; this code assumes that the 2-byte value has already been 2846 decoded as an (unsigned short) integer in network byte order (as 2847 would be done by the pseudocode in Appendix C). 2849 #include 2851 double decode_half(unsigned char *halfp) { 2852 int half = (halfp[0] << 8) + halfp[1]; 2853 int exp = (half >> 10) & 0x1f; 2854 int mant = half & 0x3ff; 2855 double val; 2856 if (exp == 0) val = ldexp(mant, -24); 2857 else if (exp != 31) val = ldexp(mant + 1024, exp - 25); 2858 else val = mant == 0 ? INFINITY : NAN; 2859 return half & 0x8000 ? -val : val; 2860 } 2862 Figure 3: C Code for a Half-Precision Decoder 2864 import struct 2865 from math import ldexp 2867 def decode_single(single): 2868 return struct.unpack("!f", struct.pack("!I", single))[0] 2870 def decode_half(half): 2871 valu = (half & 0x7fff) << 13 | (half & 0x8000) << 16 2872 if ((half & 0x7c00) != 0x7c00): 2873 return ldexp(decode_single(valu), 112) 2874 return decode_single(valu | 0x7f800000) 2876 Figure 4: Python Code for a Half-Precision Decoder 2878 Appendix E. Comparison of Other Binary Formats to CBOR's Design 2879 Objectives 2881 The proposal for CBOR follows a history of binary formats that is as 2882 long as the history of computers themselves. Different formats have 2883 had different objectives. In most cases, the objectives of the 2884 format were never stated, although they can sometimes be implied by 2885 the context where the format was first used. Some formats were meant 2886 to be universally usable, although history has proven that no binary 2887 format meets the needs of all protocols and applications. 2889 CBOR differs from many of these formats due to it starting with a set 2890 of objectives and attempting to meet just those. This section 2891 compares a few of the dozens of formats with CBOR's objectives in 2892 order to help the reader decide if they want to use CBOR or a 2893 different format for a particular protocol or application. 2895 Note that the discussion here is not meant to be a criticism of any 2896 format: to the best of our knowledge, no format before CBOR was meant 2897 to cover CBOR's objectives in the priority we have assigned them. A 2898 brief recap of the objectives from Section 1.1 is: 2900 1. unambiguous encoding of most common data formats from Internet 2901 standards 2903 2. code compactness for encoder or decoder 2905 3. no schema description needed 2907 4. reasonably compact serialization 2909 5. applicability to constrained and unconstrained applications 2911 6. good JSON conversion 2913 7. extensibility 2915 A discussion of CBOR and other formats with respect to a different 2916 set of design objectives is provided in Section 5 and Appendix C of 2917 [RFC8618]. 2919 E.1. ASN.1 DER, BER, and PER 2921 [ASN.1] has many serializations. In the IETF, DER and BER are the 2922 most common. The serialized output is not particularly compact for 2923 many items, and the code needed to decode numeric items can be 2924 complex on a constrained device. 2926 Few (if any) IETF protocols have adopted one of the several variants 2927 of Packed Encoding Rules (PER). There could be many reasons for 2928 this, but one that is commonly stated is that PER makes use of the 2929 schema even for parsing the surface structure of the data stream, 2930 requiring significant tool support. There are different versions of 2931 the ASN.1 schema language in use, which has also hampered adoption. 2933 E.2. MessagePack 2935 [MessagePack] is a concise, widely implemented counted binary 2936 serialization format, similar in many properties to CBOR, although 2937 somewhat less regular. While the data model can be used to represent 2938 JSON data, MessagePack has also been used in many remote procedure 2939 call (RPC) applications and for long-term storage of data. 2941 MessagePack has been essentially stable since it was first published 2942 around 2011; it has not yet had a transition. The evolution of 2943 MessagePack is impeded by an imperative to maintain complete 2944 backwards compatibility with existing stored data, while only few 2945 bytecodes are still available for extension. Repeated requests over 2946 the years from the MessagePack user community to separate out binary 2947 and text strings in the encoding recently have led to an extension 2948 proposal that would leave MessagePack's "raw" data ambiguous between 2949 its usages for binary and text data. The extension mechanism for 2950 MessagePack remains unclear. 2952 E.3. BSON 2954 [BSON] is a data format that was developed for the storage of JSON- 2955 like maps (JSON objects) in the MongoDB database. Its major 2956 distinguishing feature is the capability for in-place update, which 2957 prevents a compact representation. BSON uses a counted 2958 representation except for map keys, which are null-byte terminated. 2959 While BSON can be used for the representation of JSON-like objects on 2960 the wire, its specification is dominated by the requirements of the 2961 database application and has become somewhat baroque. The status of 2962 how BSON extensions will be implemented remains unclear. 2964 E.4. MSDTP: RFC 713 2966 Message Services Data Transmission (MSDTP) is a very early example of 2967 a compact message format; it is described in [RFC0713], written in 2968 1976. It is included here for its historical value, not because it 2969 was ever widely used. 2971 E.5. Conciseness on the Wire 2973 While CBOR's design objective of code compactness for encoders and 2974 decoders is a higher priority than its objective of conciseness on 2975 the wire, many people focus on the wire size. Table 7 shows some 2976 encoding examples for the simple nested array [1, [2, 3]]; where some 2977 form of indefinite-length encoding is supported by the encoding, 2978 [_ 1, [2, 3]] (indefinite length on the outer array) is also shown. 2980 +-------------+----------------------------+----------------+ 2981 | Format | [1, [2, 3]] | [_ 1, [2, 3]] | 2982 +=============+============================+================+ 2983 | RFC 713 | c2 05 81 c2 02 82 83 | | 2984 +-------------+----------------------------+----------------+ 2985 | ASN.1 BER | 30 0b 02 01 01 30 06 02 01 | 30 80 02 01 01 | 2986 | | 02 02 01 03 | 30 06 02 01 02 | 2987 | | | 02 01 03 00 00 | 2988 +-------------+----------------------------+----------------+ 2989 | MessagePack | 92 01 92 02 03 | | 2990 +-------------+----------------------------+----------------+ 2991 | BSON | 22 00 00 00 10 30 00 01 00 | | 2992 | | 00 00 04 31 00 13 00 00 00 | | 2993 | | 10 30 00 02 00 00 00 10 31 | | 2994 | | 00 03 00 00 00 00 00 | | 2995 +-------------+----------------------------+----------------+ 2996 | CBOR | 82 01 82 02 03 | 9f 01 82 02 03 | 2997 | | | ff | 2998 +-------------+----------------------------+----------------+ 3000 Table 7: Examples for Different Levels of Conciseness 3002 Appendix F. Changes from RFC 7049 3004 The following is a list of known changes from RFC 7049. This list is 3005 non-authoritative. It is meant to help reviewers see the significant 3006 differences. 3008 * Updated reference for [RFC4627] to [RFC8259] in many places 3010 * Updated reference for [CNN-TERMS] to [RFC7228] 3012 * Added a comment to the last example in Section 2.2.1 (added 3013 "Second value") 3015 * Fixed a bug in the example in Section 2.4.2 ("29" -> "49") 3017 * Fixed a bug in the last paragraph of Section 3.6 ("0b000_11101" -> 3018 "0b000_11001") 3020 Appendix G. Well-formedness errors and examples 3022 There are three basic kinds of well-formedness errors that can occur 3023 in decoding a CBOR data item: 3025 * Too much data: There are input bytes left that were not consumed. 3026 This is only an error if the application assumed that the input 3027 bytes would span exactly one data item. Where the application 3028 uses the self-delimiting nature of CBOR encoding to permit 3029 additional data after the data item, as is for example done in 3030 CBOR sequences [I-D.ietf-cbor-sequence], the CBOR decoder can 3031 simply indicate what part of the input has not been consumed. 3033 * Too little data: The input data available would need additional 3034 bytes added at their end for a complete CBOR data item. This may 3035 indicate the input is truncated; it is also a common error when 3036 trying to decode random data as CBOR. For some applications 3037 however, this may not be actually be an error, as the application 3038 may not be certain it has all the data yet and can obtain or wait 3039 for additional input bytes. Some of these applications may have 3040 an upper limit for how much additional data can show up; here the 3041 decoder may be able to indicate that the encoded CBOR data item 3042 cannot be completed within this limit. 3044 * Syntax error: The input data are not consistent with the 3045 requirements of the CBOR encoding, and this cannot be remedied by 3046 adding (or removing) data at the end. 3048 In Appendix C, errors of the first kind are addressed in the first 3049 paragraph/bullet list (requiring "no bytes are left"), and errors of 3050 the second kind are addressed in the second paragraph/bullet list 3051 (failing "if n bytes are no longer available"). Errors of the third 3052 kind are identified in the pseudocode by specific instances of 3053 calling fail(), in order: 3055 * a reserved value is used for additional information (28, 29, 30) 3057 * major type 7, additional information 24, value < 32 (incorrect or 3058 incorrectly encoded simple type) 3060 * incorrect substructure of indefinite length byte/text string (may 3061 only contain definite length strings of the same major type) 3063 * break stop code (mt=7, ai=31) occurs in a value position of a map 3064 or except at a position directly in an indefinite length item 3065 where also another enclosed data item could occur 3067 * additional information 31 used with major type 0, 1, or 6 3069 G.1. Examples for CBOR data items that are not well-formed 3071 This subsection shows a few examples for CBOR data items that are not 3072 well-formed. Each example is a sequence of bytes each shown in 3073 hexadecimal; multiple examples in a list are separated by commas. 3075 Examples for well-formedness error kind 1 (too much data) can easily 3076 be formed by adding data to a well-formed encoded CBOR data item. 3078 Similarly, examples for well-formedness error kind 2 (too little 3079 data) can be formed by truncating a well-formed encoded CBOR data 3080 item. In test suites, it may be beneficial to specifically test with 3081 incomplete data items that would require large amounts of addition to 3082 be completed (for instance by starting the encoding of a string of a 3083 very large size). 3085 A premature end of the input can occur in a head or within the 3086 enclosed data, which may be bare strings or enclosed data items that 3087 are either counted or should have been ended by a break stop code. 3089 * End of input in a head: 18, 19, 1a, 1b, 19 01, 1a 01 02, 1b 01 02 3090 03 04 05 06 07, 38, 58, 78, 98, 9a 01 ff 00, b8, d8, f8, f9 00, fa 3091 00 00, fb 00 00 00 3093 * Definite length strings with short data: 41, 61, 5a ff ff ff ff 3094 00, 5b ff ff ff ff ff ff ff ff 01 02 03, 7a ff ff ff ff 00, 7b 7f 3095 ff ff ff ff ff ff ff 01 02 03 3097 * Definite length maps and arrays not closed with enough items: 81, 3098 81 81 81 81 81 81 81 81 81, 82 00, a1, a2 01 02, a1 00, a2 00 00 3099 00 3101 * Indefinite length strings not closed by a break stop code: 5f 41 3102 00, 7f 61 00 3104 * Indefinite length maps and arrays not closed by a break stop code: 3105 9f, 9f 01 02, bf, bf 01 02 01 02, 81 9f, 9f 80 00, 9f 9f 9f 9f 9f 3106 ff ff ff ff, 9f 81 9f 81 9f 9f ff ff ff 3108 A few examples for the five subkinds of well-formedness error kind 3 3109 (syntax error) are shown below. 3111 Subkind 1: 3113 * Reserved additional information values: 1c, 1d, 1e, 3c, 3d, 3e, 3114 5c, 5d, 5e, 7c, 7d, 7e, 9c, 9d, 9e, bc, bd, be, dc, dd, de, fc, 3115 fd, fe, 3117 Subkind 2: 3119 * Reserved two-byte encodings of simple types: f8 00, f8 01, f8 18, 3120 f8 1f 3122 Subkind 3: 3124 * Indefinite length string chunks not of the correct type: 5f 00 ff, 3125 5f 21 ff, 5f 61 00 ff, 5f 80 ff, 5f a0 ff, 5f c0 00 ff, 5f e0 ff, 3126 7f 41 00 ff 3128 * Indefinite length string chunks not definite length: 5f 5f 41 00 3129 ff ff, 7f 7f 61 00 ff ff 3131 Subkind 4: 3133 * Break occurring on its own outside of an indefinite length item: 3134 ff 3136 * Break occurring in a definite length array or map or a tag: 81 ff, 3137 82 00 ff, a1 ff, a1 ff 00, a1 00 ff, a2 00 00 ff, 9f 81 ff, 9f 82 3138 9f 81 9f 9f ff ff ff ff 3140 * Break in indefinite length map would lead to odd number of items 3141 (break in a value position): bf 00 ff, bf 00 00 00 ff 3143 Subkind 5: 3145 * Major type 0, 1, 6 with additional information 31: 1f, 3f, df 3147 Acknowledgements 3149 CBOR was inspired by MessagePack. MessagePack was developed and 3150 promoted by Sadayuki Furuhashi ("frsyuki"). This reference to 3151 MessagePack is solely for attribution; CBOR is not intended as a 3152 version of or replacement for MessagePack, as it has different design 3153 goals and requirements. 3155 The need for functionality beyond the original MessagePack 3156 Specification became obvious to many people at about the same time 3157 around the year 2012. BinaryPack is a minor derivation of 3158 MessagePack that was developed by Eric Zhang for the binaryjs 3159 project. A similar, but different, extension was made by Tim Caswell 3160 for his msgpack-js and msgpack-js-browser projects. Many people have 3161 contributed to the discussion about extending MessagePack to separate 3162 text string representation from byte string representation. 3164 The encoding of the additional information in CBOR was inspired by 3165 the encoding of length information designed by Klaus Hartke for CoAP. 3167 This document also incorporates suggestions made by many people, 3168 notably Dan Frost, James Manger, Jeffrey Yasskin, Joe Hildebrand, 3169 Keith Moore, Laurence Lundblade, Matthew Lepinski, Michael 3170 Richardson, Nico Williams, Peter Occil, Phillip Hallam-Baker, Ray 3171 Polk, Tim Bray, Tony Finch, Tony Hansen, and Yaron Sheffer. 3173 Authors' Addresses 3175 Carsten Bormann 3176 Universitaet Bremen TZI 3177 Postfach 330440 3178 D-28359 Bremen 3179 Germany 3181 Phone: +49-421-218-63921 3182 Email: cabo@tzi.org 3184 Paul Hoffman 3185 ICANN 3187 Email: paul.hoffman@icann.org