idnits 2.17.1 draft-thierry-bulk-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 900 has weird spacing: '...s value the c...' == Line 928 has weird spacing: '...s value the e...' == Line 941 has weird spacing: '...s value the s...' == Line 1282 has weird spacing: '...pe name appli...' == Line 1284 has weird spacing: '...pe name bulk...' == (9 more instances...) -- The document date () is 739370 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '1' on line 1360 -- Obsolete informational reference (is this intentional?): RFC 7540 (ref. 'HTTP2') (Obsoleted by RFC 9113) Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group P. Thierry 3 Internet-Draft Thierry Technologies 4 Intended status: Experimental may 8, 2018 5 Expires: November 9, 2018 7 Binary Uniform Language Kit 1.0 8 draft-thierry-bulk-03 10 Abstract 12 This specification describes a uniform, decentrally extensible and 13 efficient format for data serialization. 15 Status of This Memo 17 This Internet-Draft is submitted in full conformance with the 18 provisions of BCP 78 and BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF). Note that other groups may also distribute 22 working documents as Internet-Drafts. The list of current Internet- 23 Drafts is at https://datatracker.ietf.org/drafts/current/. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 This Internet-Draft will expire on November 9, 2018. 32 Copyright Notice 34 Copyright (c) 2018 IETF Trust and the persons identified as the 35 document authors. All rights reserved. 37 This document is subject to BCP 78 and the IETF Trust's Legal 38 Provisions Relating to IETF Documents 39 (https://trustee.ietf.org/license-info) in effect on the date of 40 publication of this document. Please review these documents 41 carefully, as they describe your rights and restrictions with respect 42 to this document. Code Components extracted from this document must 43 include Simplified BSD License text as described in Section 4.e of 44 the Trust Legal Provisions and are provided without warranty as 45 described in the Simplified BSD License. 47 Table of Contents 49 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 50 1.1. Rationale . . . . . . . . . . . . . . . . . . . . . . . . 3 51 1.1.1. Definitions . . . . . . . . . . . . . . . . . . . . . 3 52 1.1.2. State of the art . . . . . . . . . . . . . . . . . . 4 53 1.2. Format overview . . . . . . . . . . . . . . . . . . . . . 5 54 1.3. Conventions and Terminology . . . . . . . . . . . . . . . 6 55 2. BULK syntax . . . . . . . . . . . . . . . . . . . . . . . . . 8 56 2.1. Parsing algorithm . . . . . . . . . . . . . . . . . . . . 8 57 2.1.1. Evaluation . . . . . . . . . . . . . . . . . . . . . 9 58 2.2. Forms . . . . . . . . . . . . . . . . . . . . . . . . . . 10 59 2.2.1. starting marker byte . . . . . . . . . . . . . . . . 10 60 2.2.2. ending marker byte . . . . . . . . . . . . . . . . . 10 61 2.2.3. Difference between sequence and form . . . . . . . . 10 62 2.3. Atoms . . . . . . . . . . . . . . . . . . . . . . . . . . 10 63 2.3.1. nil . . . . . . . . . . . . . . . . . . . . . . . . . 10 64 2.3.2. Array . . . . . . . . . . . . . . . . . . . . . . . . 11 65 2.3.3. Binary words . . . . . . . . . . . . . . . . . . . . 11 66 2.3.4. Reserved marker bytes . . . . . . . . . . . . . . . . 14 67 2.3.5. Reference . . . . . . . . . . . . . . . . . . . . . . 14 68 3. Standard namespaces . . . . . . . . . . . . . . . . . . . . . 15 69 3.1. BULK core namespace . . . . . . . . . . . . . . . . . . . 15 70 3.1.1. Version . . . . . . . . . . . . . . . . . . . . . . . 15 71 3.1.2. true . . . . . . . . . . . . . . . . . . . . . . . . 16 72 3.1.3. false . . . . . . . . . . . . . . . . . . . . . . . . 16 73 3.1.4. Strings encoding . . . . . . . . . . . . . . . . . . 16 74 3.1.5. IANA registered character set . . . . . . . . . . . . 16 75 3.1.6. Windows code page . . . . . . . . . . . . . . . . . . 17 76 3.1.7. Namespaces . . . . . . . . . . . . . . . . . . . . . 17 77 3.1.8. Definitions . . . . . . . . . . . . . . . . . . . . . 18 78 3.1.9. Arithmetic . . . . . . . . . . . . . . . . . . . . . 21 79 3.1.10. Compact formats . . . . . . . . . . . . . . . . . . . 22 80 4. Extension namespaces . . . . . . . . . . . . . . . . . . . . 26 81 5. Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . 26 82 5.1. Profile redundancy . . . . . . . . . . . . . . . . . . . 26 83 5.2. Standard profile . . . . . . . . . . . . . . . . . . . . 26 84 6. Security Considerations . . . . . . . . . . . . . . . . . . . 27 85 6.1. Parsing . . . . . . . . . . . . . . . . . . . . . . . . . 27 86 6.2. Forwarding . . . . . . . . . . . . . . . . . . . . . . . 27 87 6.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 27 88 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 27 89 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 28 90 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 28 91 9.1. Normative References . . . . . . . . . . . . . . . . . . 29 92 9.2. Informative references . . . . . . . . . . . . . . . . . 29 93 9.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 29 94 Appendix A. Robust namespace definition . . . . . . . . . . . . 30 95 A.1. Selective authority . . . . . . . . . . . . . . . . . . . 30 96 A.2. Open authority . . . . . . . . . . . . . . . . . . . . . 30 97 Appendix B. Verifiable namespace bootstrap . . . . . . . . . . . 31 98 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 31 100 1. Introduction 102 1.1. Rationale 104 This specification aims at finding an original trade-off between 105 uniformity, generality, extensibility, decentralization, compactness 106 and processing speed for a data format. It is our opinion that every 107 widely used existing format occupy a different position than this one 108 in the solution space for formats, hence this new design. It is also 109 our opinion that most of those existing formats constitute an optimal 110 solution for their specific use case, either in a absolute sense, or 111 at least at the time of their design. But the ever-changing field of 112 IT now faces new challenges that call for a new approach. 114 In particular, whereas the previous trend for Internet and Web 115 standards and programming tools has been to create human-readable 116 syntaxes for data and protocols, the advent of technologies like 117 protocol buffers [protobuf], Thrift [Thrift], the various binary 118 serializations for JSON like Avro [Avro] or Smile [Smile], or the 119 binary HTTP/2 [HTTP2] seem to indicate that the time is ripe for a 120 generalized use of binary, reserved until now for the low-level 121 protocols and arbitrary data storage. The lessons about flexibility 122 learnt in the previous switch from binary to plain text can now be 123 applied to efficient binary syntaxes. 125 1.1.1. Definitions 127 By *uniformity*, we mean the property of a syntax that can be parsed 128 even by an application that doesn't understand the semantics of every 129 part of the processed data. Of course, almost all syntaxes that 130 feature uniformity contain a limited number of non uniform elements. 131 Also, uniformity really only has value in the face of extension, as a 132 fixed syntax doesn't need uniformity (it only makes the 133 implementation simpler). 135 Almost all extensible syntaxes have their extensible part uniform to 136 a great degree. For the purpose of this specification, uniformity 137 has hence been evaluated on two criteria: first, the number of non 138 uniform elements (and, incidentally, their diversity), second, the 139 fact that the uniformity of the extensible part is not a limitation 140 to the users (i.e. that the temptation to extend the language in a 141 non-uniform way is as absent as possible). 143 A good counter-example is found in most programming languages. 144 Adding a new branching construct cannot be done in a terse way 145 without modifying the underlying implementation. Such a construct 146 either cannot be defined by user code (because of evaluation rules) 147 or can in a terribly verbose and inconvenient way (with lots of 148 boilerplate code). Notable exceptions to this limitation of 149 programming languages are Lisp, Haskell and stack programming 150 languages. 152 On the other hand, a stack programming language is the canonical 153 example of a non-uniform language. Each operator takes a number of 154 operands from the stack. Not knowing the arity of an operator makes 155 it impossible to continue parsing, even when its evaluation was 156 optional to the final processing. In the design space, stack 157 programming languages completely sacrifice uniformity to achieve one 158 of the highest combination of extensibility, compactness and speed of 159 processing. 161 By *generality*, we mean the ability of a syntax to lend itself to 162 describe any kind of data with a reasonable (or better yet, high) 163 level of compactness and simplicity. For example, although both 164 arrays and linked lists could be considered very general as they are 165 both able to store any kind of data, they actually are at the 166 respective cost of complexity (arrays need the embedding of data 167 structure in the data or in the processing logic) and size (in-memory 168 linked lists can waste as much as half or two third of the space for 169 the overhead of the data structure). 171 By *decentralization*, we mean the ability to extend the syntax in a 172 way that avoid naming collisions without the use of a central 173 registry. Note that the DNS, as we use it, is NOT decentralized in 174 this sense, but distributed, as it cannot work without its root 175 servers and not even without prior knowledge of their location. 177 1.1.2. State of the art 179 Uniformity, generality and extensibility are usually highly-valued 180 traits in formats design. Programming languages obviously feature 181 them foremost, although their generality usually stops at what they 182 are supposed to express: procedures. Most of them are ill-suited to 183 represent arbitrary data, but notable exceptions include Lisp (where 184 "code is data") and Javascript, from which a subset has been 185 extracted to exchange data, JSON, which has seen a tremendous success 186 for this purpose. JSON may lack in generality and compactness, but 187 its design makes its parsing really straightforward and fast. All of 188 them, though, lack decentralization. Some of them make it possible 189 to extend them in a distrubuted way if some discipline is followed 190 (for example, by naming modules after domain names), but the 191 discipline is not mandatory (and even with domain names, a change of 192 ownership makes it possible for name collisions). 194 The SGML/XML family of formats also feature uniformity, generality 195 and extensibility and actually fare much better than programming 196 languages on the three fronts. XML namespaces also make XML naming 197 distributed and there have been attempts at making it compact (e.g. 198 EXI from W3C, Fast Infoset from ISO/ITU or EBML). 200 All the previously cited formats clearly lack compactness, although 201 just applying standard compression techniques would sacrifice only 202 very little processing time to gain huge size reductions on most of 203 their intended use cases. 205 So-called binary formats pretty much exhibit the opposite trade-offs. 206 Most of them are not uniform to achieve better compactness. Some are 207 specifically designed for a great generality, but many lack 208 extensibility. When they are extensible, it's never in a 209 decentralized way, again for reasons that have to do with 210 compactness. They are usually extremely fast to parse. 212 Actually, many binary formats are not so much formats but formats 213 frameworks, and exclude extensibility by design. For each use case, 214 an IDL compiler creates a brand new format that is essentially 215 incompatible with all other formats created by the same compiler 216 (EBML specifically cites this property among its own disadvantages). 217 If the IDL compiler and framework are correctly designed, such a 218 format usually represent an optimum in compactness and speed of 219 processing, as the compiler can also automatically generate an ad-hoc 220 optimized parser. 222 1.2. Format overview 224 A BULK stream is a stream of 8-bit bytes, in big-endian order. 225 Parsing a BULK stream yields a sequence of expressions, which can be 226 either atoms or forms, which are sequences of expressions. The 227 syntax of forms is entirely uniform, without a single exception: a 228 starting byte marker, a sequence of expressions and an ending byte 229 marker. Among atoms, only nil (the null byte), arrays and fixed- 230 sized binary words have a special syntax, for efficiency purposes. 231 Even booleans and floating-point numbers follow the uniform syntax 232 that every other expression follows. 234 Non uniform atoms start with a marker byte, followed by a static or 235 dynamic number of bytes, depending on the type. 237 Any other atom is a reference, which consists of a namespace marker 238 (in almost all cases, a single byte) followed by an identifier within 239 this namespace (a single byte). All in all, a very little sacrifice 240 is made in compactness for the benefit of a very simple syntax: apart 241 from nil, nothing is smaller than 2 bytes, and as most forms involve 242 a reference followed by some content, a form is usually 4 bytes + its 243 content. 245 A namespace marker in a BULK stream is associated to a namespace 246 identified by some identifier guaranteed to be unique without 247 coordination (like a UUID or cryptographical hash), thus ensuring 248 decentralized extensibility. The stream can be processed even if the 249 application doesn't recognize the namespace. Parsing remains 250 possible thanks to the uniform syntax. 252 Combination of BULK namespaces, BULK streams and even other formats 253 doesn't need any content transformation to work. Here are some 254 examples: 256 o The content of a BULK stream, enclosed in sequence starting and 257 ending byte markers, constitute a valid BULK expression. Thus 258 BULK streams can be packed or annotated within a BULK stream 259 without modification. Annotation use cases include adding 260 metadata or cryptographic signature. 262 o A BULK format could specify in its syntax the place for an 263 expression holding metadata. Whether the specification provides 264 its own metadata forms or not, an application could use a BULK 265 serialization for MARC, TEI Header, XML or RDF for this metadata 266 expression. The vocabulary selected would be univocally expressed 267 by the namespace and every vocabulary would be parsed by the same 268 mechanisms. 270 o Whenever a content must be stored as-is instead of serialized or a 271 highly-optimized ad hoc serialization exists for some data, 272 anything can always be stored within an array. They can contain 273 arbitray bytes and there is no limit to their size. 275 Furthermore, BULK expressions can be evaluated. Most expressions 276 evaluate to themselves, but some evaluate by default to the result of 277 a function call, making it possible to serialize data in an even more 278 compact form, by eliminating boilerplate data and repeated patterns. 280 1.3. Conventions and Terminology 282 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 283 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 284 document are to be interpreted as described in RFC 2119 [RFC2119]. 286 Literal numerical values are provided in decimal or hexadecimal as 287 appropriate. Hexadecimal literals are prefixed with "0x" to 288 distinguish them from decimal literals. 290 The text notation of the BULK stream uses mnemonics for some bytes 291 sequences. Mnemonics are series of characters, excluding all capital 292 letters and white space, like "this-is-one-mnemonic" or "what- 293 the-%S.!?#-is-that?". They are always separated by white space. 294 Outside the use of mnemonics, a sequence of bytes (of one or more 295 bytes) can be represented by its hexadecimal value as an unsigned 296 integer (e.g. "0x3F" or "0x3A0B770F"). Some types in this 297 specification define a special syntax for their representation in the 298 text notation. 300 In the grammar, a shape is a pattern of bytes, following the rules of 301 the text notation for a BULK stream. Apart from mnemonics and fixed 302 sequences of bytes, a shape can contain: 304 o an arbitrary sequence of a fixed number of bytes, represented by 305 its size, i.e. a number of bytes in decimal immediately followed 306 by a B uppercase letter (e.g. "4B") 308 o a typed sequence of bytes, represented by the name of its type, a 309 capitalized word (e.g. "Foo"); this means a sequence of bytes 310 whose specific yield (cf. Section 2.1) has this type 312 o a named sequence of bytes (of zero or more bytes), represented by 313 a series of any character excluding '{}' between '{' and '}' (e.g. 314 "{quux}"); a named sequence can be typed or sized, in which case 315 it is immediately followed by ':' and a type or size (e.g. 316 "{quux}:Bar" or "{quux}:12B") 318 When an entire shape describes the byte sequence of an atom, it is 319 the normative specification for parsing it, but shapes of forms are 320 only normative with respect to their default evaluation and the 321 corresponding semantics. A reference defined with a form shape can 322 be used in different shapes, albeit with different semantics and 323 value and even when used in its default shape, a processing 324 application MAY give it alternative semantics (although this is not 325 recommended). 327 For example, this specification defines a way do specify a string 328 encoding with forms of the shape "( stringenc {enc}:Expr )". But the 329 shapes "( stringenc {arg1}:Int {arg2}:Int )" or "( {arg1}:Int 330 stringenc {arg2}:Int )" are syntactly valid. They just have 331 unspecified semantics, as far as this specification is concerned. 333 2. BULK syntax 335 A BULK stream is a sequence of 8-bit bytes. Bits and bytes are in 336 big-endian order. The result of parsing a BULK stream is a sequence 337 of abstract data, called the abstract yield. BULK parsing is 338 injective: a BULK stream has only one abstract yield, but different 339 BULK streams can have the same abstract yield. 341 A processing application is not expected to actually produce the 342 abstract yield, but an adaptation of the abstract yield to its own 343 implementation, called the concrete yield. Also, some expressions in 344 a BULK stream may have the semantics of a transformation of the 345 abstract yield. A processing application MAY thus not produce or 346 retain the concrete yield but the result of its transformation. This 347 specification deals mainly with the byte sequence and the abstract 348 yield and occasionnally provides guidelines about the concrete yield. 349 Of course, a processing application MAY not produce the concrete 350 yield at all but produce various side effects from parsing the BULK 351 stream. 353 The abstract yield is a sequence of expressions. Expressions can be 354 atoms or forms. Forms are sequences of expressions. If a byte 355 sequence is parsed as one or several expressions, this byte sequence 356 is said to denote these expressions. 358 When a sequence of bytes is named in a shape, its name can be used in 359 this specification to designate either the byte sequence, or the 360 expression or sequence of expressions it denotes. When there could 361 be ambiguity, this specification specifies which is designated. 363 2.1. Parsing algorithm 365 The parser operates with a context, which is a sequence of 366 expressions. Each time an expression is parsed, it is appended at 367 the end of the context. The initial context is the abstract yield. 369 At the beginning of a BULK stream and after having consumed the byte 370 sequence denoting a complete expression, the parser is at the 371 dispatch stage. At this stage, the next byte is a marker byte, which 372 tells the parser what kind of expression comes next (the marker byte 373 is the first byte of the sequence that denotes an expression). The 374 expression appended to the context after reading a byte sequence is 375 called the specific yield of the byte sequence. 377 The "0x1" and "0x2" marker bytes are special cases. When the parser 378 reads "0x1", it immediately appends an empty sequence to the current 379 context. This sequence becomes the new context. This new context 380 has the previous context as parent. Then the parser returns to its 381 dispatch stage. When the parser reads "0x2", it appends nothing to 382 the context, but instead the parent of the current context becomes 383 the new context and the parser returns to the dispatch stage. Thus 384 it is a parsing error to read "0x2" when the context is the abstract 385 yield. 387 The scope of an expression is the part of its context that follows 388 the expression. 390 This specification designates the context where the expressions 391 contained in a form are appended as the inner scope of the form. Its 392 parent context is designated as the outer scope of the form. 394 Whenever a parsing error is encountered, parsing of the BULK stream 395 MUST stop. 397 2.1.1. Evaluation 399 A processing application MAY implement evaluation of BULK expressions 400 and streams. When evaluating a BULK stream, when the parser gets to 401 the dispatch stage and the context is the abstract yield, the last 402 expression in the context is replaced by what it evaluates to. (of 403 course, this description is supposed to provide the semantics of BULK 404 evaluation, but a processing application MAY implement evaluation 405 with a different algorithm as long as it provides the same semantics) 407 The default evaluation rule is that an expression evaluates to 408 itself. A name within a namespace can have a value, which is what a 409 reference associated to this name evaluates to. A reference whose 410 marker value is associated to no namespace or whose name has no value 411 evaluates to itself. How self-evaluating BULK expressions are 412 represented in the concrete yield is application-dependent, but 413 future specifications MAY define a standard API to access it, similar 414 to the Document Object Model for XML. 416 The evaluation of a sequence obeys a special rule, though: if the 417 first expression of the sequence has type "Function", that function 418 is called with an argument list and the sequence evaluates to the 419 return value. 421 If the function has type "LazyFunction", the argument list is the 422 rest of the sequence. If the function has type "EagerFunction", the 423 argument list is the rest of the sequence, where each expression is 424 replaced by what it evaluates to. Any expression that has type 425 "LazyFunction" or "EagerFunction" also has type "Function". 427 If the result of the evaluation of a "Function" is a sequence, it is 428 evaluated in turn. 430 2.2. Forms 432 2.2.1. starting marker byte 434 marker "0x1" 436 mnemonic "(" 438 2.2.2. ending marker byte 440 marker "0x2" 442 mnemonic ")" 444 2.2.3. Difference between sequence and form 446 There is a difference between a byte sequence denoting a sequence of 447 expressions among the current context and a byte sequence denoting a 448 form (i.e. a single expression that contains a sequence of 449 expressions). As an example, let's examine several forms of the 450 shape "( foo {seq} )". 452 o In the form "( foo nil nil nil )", {seq} denotes 3 expressions, 453 and they are three atoms in the yield. 455 o In the form "( foo nil )", {seq} is a single expression in the 456 yield, and that expression is an atom. 458 o In the form "( foo ( nil nil nil ) )", {seq} is also a single 459 expression in the yield, and that expression is a form, a sequence 460 in the yield. 462 In a shape, when a byte sequence must yield a single expression, it 463 has the type "Expr". So the last two examples fit the shape "( foo 464 {seq}:Expr )" but not the first. When a byte sequence must yield a 465 form, it has type "Form". Thus the shape "( foo {bar}:Form )" is 466 equivalent to "( foo ( {baz} ) )". Either one MAY be used. 468 2.3. Atoms 470 2.3.1. nil 472 marker "0x0" (mnemonic: "nil") 474 shape "nil" 476 Apart from being a possible short marker value, the fact that the 477 "0x0" byte represents a valid atom means that a sequence of null 478 bytes is a valid part of a BULK stream, thus making the format less 479 fragile. In a network communication, nil atoms can be sent to keep 480 the channel open. They can also be used as padding at the end of a 481 form or between forms. 483 2.3.2. Array 485 marker "0x3" (mnemonic: "#") 487 shape "# Int {content}" 489 Arrays have a special parsing rule. After consuming the marker byte, 490 the parser returns to the dispatch stage. It is a parser error if 491 the parsed expression is not of type Int or if its value cannot be 492 recognized. This integer is not added to any context, but the parser 493 consumes as many bytes as this integer and they constitute the 494 content of this array. 496 If two arrays have the shapes "# {s1} {c1}" and "# {s2} {c2}" and if 497 "{s1+s2}" denotes the sum of the integers "{s1}" and "{s2}", then 498 their concatenation is "# {s1+s2} {c1} {c2}". 500 In the text notation, a quoted string represents an array containing 501 the encoding of that string in the current encoding. 503 Types: "Array", "Bytes" 505 In a shape, the type String is synonymous with Array, but means that 506 the content of the array is supposed to be taken as a string. 508 2.3.3. Binary words 510 A unsigned word can be interpreted either as a bits sequence or as an 511 unsigned integer in binary notation. The choice depends on the 512 context and the application. Actually, many processing applications 513 may not need make any choice, as most programming language 514 implementations actually also confuse unsigned integers and bits 515 sequences to some extent. 517 2.3.3.1. 8 bits word 519 marker "0x4" (mnemonic: "w8") 521 shape "w8 1B" 523 Types: "Int", "Word", "Word8", "Bytes" 525 2.3.3.2. 16 bits word 527 marker "0x5" (mnemonic: "w16") 529 shape "w16 2B" 531 Types: "Int", "Word", "Word16", "Bytes" 533 2.3.3.3. 32 bits word 535 marker "0x6" (mnemonic: "w32") 537 shape "w32 4B" 539 Types: "Int", "Word", "Word32", "Bytes" 541 2.3.3.4. 64 bits word 543 marker "0x7" (mnemonic: "w64") 545 shape "w64 8B" 547 Types: "Int", "Word", "Word64", "Bytes" 549 2.3.3.5. 128 bits word 551 marker "0x8" (mnemonic: "w128") 553 shape "w128 16B" 555 Types: "Int", "Word", "Word128", "Bytes" 557 2.3.3.6. Negative integers 559 Note that BULK doesn't include signed words using two's complement, 560 because BULK's design makes them inherently wasteful. If you were to 561 design an ad hoc binary format that is parsed according to a schema 562 known in advance, like TCP/IP, and you were to include a field that 563 can cointain either a positive or negative integer, you would need to 564 use one bit to indicate that integer's sign, in which case you might 565 as well use two's complement, whose properties are well known, lets 566 you write to and from memory, etc... 568 But in BULK, a word used for a positive integer (otherwise known as 569 an unsigned integer) is already preceded by a marker byte. If BULK 570 included signed integers, there would never be a sense in using them 571 for positive integers, so a one-byte signed integer would only be 572 used for integers between -1 and -127. With markers for negative 573 integers, the one-byte word can be used for integers between -1 and 574 -255. 576 Also, BULK is a format for storage and wire transport, not in-memory 577 data, where two's complement is useful because it supports bitwise 578 arithmetic, something that isn't relevant here. 580 The only foreseen use of two's complement signed integers is in large 581 arrays of data, like raster images, sound, video or any other 582 temporal series, e.g. physical measures. In that use case, the one- 583 byte overhead for each number is obviously unacceptable and they 584 would be stored in an array. A surrounding form or the format's 585 specification would tell how to interpret the contents of that array, 586 in terms of size and signedness. 588 The semantics of each of the following words is the opposite of the 589 countained unsigned integer. For example, "0xA 0x1 0xFF" denotes the 590 number -511. 592 2.3.3.6.1. 8 bits negative word 594 marker "0x9" (mnemonic: "neg8") 596 shape "neg8 1B" 598 Types: "Int", "Word", "Word8", "Bytes" 600 2.3.3.6.2. 16 bits signed word 602 marker "0xA" (mnemonic: "neg16") 604 shape "neg16 2B" 606 Types: "Int", "Word", "Word16", "Bytes" 608 2.3.3.6.3. 32 bits signed word 610 marker "0xB" (mnemonic: "neg32") 612 shape "neg32 4B" 614 Types: "Int", "Word", "Word32", "Bytes" 616 2.3.3.6.4. 64 bits signed word 618 marker "0xC" (mnemonic: "neg64") 620 shape "neg64 8B" 621 Types: "Int", "Word", "Word64", "Bytes" 623 2.3.3.6.5. 128 bits signed word 625 marker "0xD" (mnemonic: "neg128") 627 shape "neg128 16B" 629 Types: "Int", "Word", "Word128", "Bytes" 631 2.3.4. Reserved marker bytes 633 Marker bytes "0xE-0x1F" are reserved for future major versions of 634 BULK. It is a parser error if a BULK stream with major version 1 635 contains such a marker byte. 637 2.3.5. Reference 639 marker "0x20-0xFF" 641 shape "{ns}:1B {name}:1B" 643 The "{ns}" byte is a value associated with a namespace. Values 644 "0x20-0x27" are reserved for namespaces defined by BULK 645 specifications. Greater values can be associated with namespaces 646 identified by a unique identifier. 648 The "{name}" byte is the name within the namespace. Vocabularies 649 with more than 256 names thus need to be spread accross several 650 namespaces. 652 The specification of a namespace SHOULD include a mnemonic for the 653 namespace and for each defined name. When descriptions use several 654 namespaces, the mnemonic of a reference SHOULD be the concatenation 655 of the namespace mnemonic, ":" and the name mnemonic if there can be 656 an ambiguity. For example, the "fp" name in namespace "math" becomes 657 "math:fp". 659 Type: "Ref" 661 2.3.5.1. Special case 663 References have a special parsing rule. In case a BULK stream needs 664 an important number of namespaces, if the marker byte is "0xFF", the 665 parser continues to read bytes until it finds a byte different than 666 0xFF. The sum of each of those bytes taken as unsigned integers is 667 the value associated with a namespace. For example, the reference 668 denoted by the bytes "0xFF 0xFF 0x8C 0x1A" is the name 26 in the 669 namespace associated with 650. 671 3. Standard namespaces 673 Standard namespaces have a fixed marker value and are not identified 674 by a unique identifier. 676 3.1. BULK core namespace 678 marker "0x20" (mnemonic: "bulk") 680 3.1.1. Version 682 name "0x0" (mnemonic: "version") 684 shape "( version {major}:Int {minor}:Int )" 686 When parsing a BULK stream, a processing application MUST determine 687 explicitely the major and minor version of the BULK specification 688 that the stream obeys. This information MAY be exchanged out-of- 689 band, if BULK is used to exchange a number a very small messages, 690 where repeated headers of 8 bytes might become too big a overhead. A 691 processing application MUST NOT assume a default version. 693 If the version is expressed within a BULK stream, this form MUST be 694 the first in the stream. In any other place, this form has no 695 semantics attached to it. This specification defines BULK 1.0. When 696 writing a BULK stream, an application MUST denote {major} and {minor} 697 by the smallest byte sequence possible using unsigned words from this 698 specification. 700 An application writing a BULK stream to long-term storage (e.g. in a 701 file or a database record) SHOULD include a "version" form. 703 Two BULK versions with the same major version MUST share the same 704 parsing rules and the same definitions of marker bytes. Changing the 705 syntax or semantics of existing marker bytes and using marker bytes 706 in the reserved interval warrants a new major version. Changing the 707 syntax or semantics of existing names in standard namespaces also. 709 Adding standard namespaces or adding names in existing standard 710 namespaces warrants a new minor version. 712 3.1.2. true 714 name "0x1" (mnemonic: "true") 716 shape "true" 718 Type: "Boolean". 720 3.1.3. false 722 name "0x2" (mnemonic: "false") 724 shape "false" 726 Type: "Boolean". 728 3.1.4. Strings encoding 730 name "0x3" (mnemonic: "stringenc") 732 shape "( stringenc {enc}:Encoding )" 734 This tells the processing application that, in the scope of this 735 expression, all expressions that are understood by the application as 736 character strings will be encoded with the encoding designated by 737 {enc}. 739 As the abstract yield doesn't contains strings but expressions that 740 will be used as strings by the application, it is not a parsing error 741 if the application doesn't recognize {enc}. In this situation, it is 742 a parsing error when the application actually needs to decode a byte 743 sequence as a string. It is not a parsing error when a processing 744 application only transmits a byte sequence encoding a string, if it 745 can accurately convey the encoding to the receiving application. 747 3.1.5. IANA registered character set 749 name "0x4" (mnemonic: "iana-charset") 751 shape "( iana-charset {id}:Int )" 753 This designates the string encoding registered among the IANA 754 Character Sets [IANA-Charsets] whose MIBenum is {id}. 756 Type: "Encoding". 758 3.1.6. Windows code page 760 name "0x5" (mnemonic: "code-page") 762 shape "( code-page {id}:Int )" 764 This designates the string encoding among Windows code pages whose 765 identifier is {id}. 767 Type: "Encoding". 769 3.1.7. Namespaces 771 3.1.7.1. Note about unique identifiers 773 Several objects in this specification and future BULK specifications 774 are identified by something of type UniqueID. This specification 775 doesn't define any UniqueID form on purpose, because what constitutes 776 a unique enough identifier varies over time and domains and because 777 BULK's nature makes specifying them in advance actually unncessary 778 (cf. Verifiable namespace bootstrap). 780 Anything, including a bare array containing some identifying byte 781 string, could be used as a UniqueID, but we recommend enclosing any 782 such data in a form specifying how to interpret it. For example, a 783 "crypto" namespace could include a "md6" name, to use forms of shape 784 "( crypto:md6 Word128 )" as UniqueID. 786 3.1.7.2. New namespace 788 name "0x6" (mnemonic: "ns") 790 shape "( ns {marker}:Int {id}:UniqueID )" 792 This associates the namespace identified by {id} to the value 793 {marker}, within the scope of this expression. 795 3.1.7.3. Package 797 name "0x7" (mnemonic: "package") 799 shape "( package {id}:UniqueID {namespaces} )" 801 This creates a package identified by {id}. Packages are immutable, 802 {id} MUST be verifiable against the byte sequence {namespaces}. 803 {namespaces} must be a sequence of expressions of type UniqueID, each 804 identifying a BULK namespace. 806 3.1.7.4. Import 808 name "0x8" (mnemonic: "import") 810 shape "( import {base}:Int {count}:Int {id}:UniqueID )" 812 This associates the first {count} namespaces in the package 813 identified by {id} with a continuous range of values starting at 814 {base} within the scope of this expression. 816 3.1.8. Definitions 818 To define a reference is to change the the value of its name in its 819 namespace (as identified by its unique identifier, not the marker 820 value) within a certain scope. 822 If a BULK stream is not evaluated, the semantics of a definition are 823 entirely application-dependent. 825 When a BULK stream containing definitions for a namespace comes from 826 a trusted source (i.e. in configuration files of the application, or 827 in the communication with an agent that has been granted the relevant 828 authority), an application MAY give those definitions long-lasting 829 semantics (i.e. keep the values of the names at the end of parsing). 830 This is the preferred mechanism for bulk namespace definition when 831 the semantics of the defined expressions can be expressed completely 832 by BULK forms. 834 3.1.8.1. Simple definition 836 name "0x9" (mnemonic: "define") 838 shape "( define {ref}:Ref {value}:Expr )" 840 This defines the reference {ref} to the yield of {value} in the outer 841 scope of this form. 843 3.1.8.2. Named definition 845 name "0xA" (mnemonic: "mnemonic/def") 847 shape "( mnemonic/def {ref}:Ref {mnemonic}:String {doc}:Expr {value} 848 )" 850 This suggests {mnemonic} as the mnemonic of the name designated by 851 {ref} in its namespace. If {value} is of type Expr, this defines the 852 reference {ref} to {value} in the outer scope of this form. 854 {doc} is any expression that provides a documentation for this 855 reference. If it has type Array, it MUST be a string. It could be 856 any kind of metadata or document type. 858 3.1.8.3. Namespace description 860 name "0xB" (mnemonic: "ns-mnemonic") 862 shape "( ns-mnemonic {ns}:Expr {mnemonic}:String {doc} )" 864 This suggests {mnemonic} as the mnemonic of the namespace designated 865 by {ns} (which can be the integer to which this namespace is 866 associated, a reference in this namespace or the unique identifier of 867 this namespace). 869 3.1.8.4. Verifiable namespace definition 871 name "0xC" (mnemonic: "verifiable-ns") 873 shape "( verifiable-ns {marker}:Int {id}:UniqueID {mnemonic}:Expr 874 {definitions} )" 876 This associates the namespace identified by {id} to the value 877 {marker}, within the outer and inner scopes of this form. Verifiable 878 namespaces are immutable, {id} MUST be verifiable against the byte 879 sequence "{mnemonic} {definitions}". Defining a reference in the 880 inner scope of this form also defines that reference in the outer 881 scope of this form. 883 For this verification to be meaningful, {definitions} MUST NOT 884 contain any reference from a namespace before it is assoicated in 885 {definitions}. 887 If {mnemonic} is of type String, then this suggests it as the 888 mnemonic of the namespace. Else it MUST be "nil". 890 3.1.8.5. Array concatenation 892 name "0x10" (mnemonic: "concat") 894 shape "( concat {array1} {array2} )" 896 Name's type EagerFunction 898 Form's type Array 900 Form's value the concatenation of {array1} and {array2}. 902 3.1.8.6. Substituton 904 3.1.8.6.1. Substitution function 906 name "0x11" (mnemonic: "subst") 908 shape "( subst {code} )" 910 Name's type LazyFunction 912 Form's type EagerFunction 914 Form's value A substitution function whose return value is the value 915 of {code}. Within {code}'s specific yield, the names "arg" and 916 "rest" are defined: 918 3.1.8.6.2. Argument 920 name "0x12" (mnemonic: "arg") 922 shape "( arg {n}:Int )" 924 Name's type EagerFunction 926 Form's type Expr 928 Form's value the element number {n} (starting at zero) of the 929 substitution function's arguments list 931 3.1.8.6.3. Rest of arguments list 933 name "0x13" (mnemonic: "rest") 935 shape "( rest {n}:Int )" 937 Name's type EagerFunction 939 Form's type Expr 941 Form's value the substitution function's arguments list without its 942 first {n} elements. 944 3.1.8.6.3.1. Examples 946 Here is a definition of the inverse followed by the number 1/2, 1/3 947 and 1/4: 949 "( define inverse ( subst ( frac 1 ( arg 0 ) ) ) ) ( inverse 2 ) ( 950 inverse 3 ) ( inverse 4 )" 952 Substitution will splice multiple expressions in place: 954 The evaluation of "( ( subst 1 ( rest 0 ) 2 ) 3 4 )" must yield the 955 same as "( 1 3 4 2 )" 957 3.1.9. Arithmetic 959 In the text notation of a BULK stream, a decimal integer represents 960 the smallest byte sequence that denotes this integer with atoms and 961 forms from this specification. For example, "( 31 256 )" is a 962 notation for the bytes "0x1 0x4 0x1F 0x5 0x1 0x0 0x2". 964 3.1.9.1. Fraction 966 name "0x20" (mnemonic: "frac") 968 shape "( frac {num}:Int {div}:Int )" 970 This is the number {num}/{div}. 972 Type: "Number". 974 3.1.9.2. Arbitrary precision signed integer 976 name "0x21" (mnemonic: "bigint") 978 shape "( bigint {bits}:Bytes )" 980 The bits contained in {bits} is the value of this integer in 981 two's-complement notation. 983 Type: "Number", "Int". 985 3.1.9.3. Binary floating-point number 987 name "0x22" (mnemonic: "binary") 989 shape "( binary {bits}:Bytes )" 991 This is a floating-point number expressed in IEEE 754-2008 binary 992 interchange format. If {bits} is an Array, the size of its contents 993 must be a multiple of 32 bits, as per IEEE 754-2008 rules. {bits} 994 MUST NOT have type Word8. 996 Types: "Number", "Float". 998 3.1.9.4. Decimal floating-point number 1000 name "0x23" (mnemonic: "decimal") 1002 shape "( decimal {bits}:Bytes )" 1004 This is a floating-point number expressed in IEEE 754-2008 decimal 1005 interchange format. If {bits} is an Array, the size of its contents 1006 must be a multiple of 32 bits, as per IEEE 754-2008 rules. {bits} 1007 MUST NOT have type Word8. 1009 Types: "Number", "Float". 1011 3.1.10. Compact formats 1013 This specification and other specifications in the official BULK 1014 suite take the option to use as their basic building block a form 1015 with a distinguishing reference as first element (basically, they are 1016 a binary representation of an abstract syntax tree). As noted 1017 previously, this means that most representations weigh 4 bytes plus 1018 their actual content, which will in turn have some overhead because 1019 of one or several marker bytes. 1021 But when there is a special need for compactness, BULK makes it 1022 possible to design protocols and formats with different trade-offs, 1023 while retaining its property of being parseable by processing 1024 applications not knowing the protocol in its entirety. 1026 On one end of the spectrum, a format might choose to use an array to 1027 encapsulate an ad hoc binary format. An extreme use of this scheme 1028 would be to use BULK just to make explicit the binary format used. 1029 With a known profile (for example with a file extension and/or media 1030 type for such explicitly typed BLOBs), a BULK stream that consists 1031 solely of the version form, a reference that describes the binary 1032 format and an array will have a total overhead of 14, 16 or 20 bytes 1033 if the data's size is representable in 16, 32 or 64 bits. 1035 Still, even this extreme in the design space retains the ability to 1036 insert expressions in the BULK stream, whatever their type. Thus 1037 metadata can be added about data that is represented in a format that 1038 doesn't allow for metadata or for limited metadata. 1040 In-between these two extremes, of compactness or uniformity, several 1041 options are available to produce a format that leverages the BULK 1042 parser a lot more than using a single array while being more compact 1043 than a classical BULK format. The following forms provide a standard 1044 way to create such formats. 1046 A flat sequence of operators and operands is called a BULK bytecode. 1047 Prefix bytecodes are those where operators come before operands, 1048 postfix bytecodes are those where operators come after operands. In 1049 the following forms, operators MUST be references (as usual with 1050 BULK, another namespace could define other bytecode forms with 1051 different rules). 1053 The default semantics of a bytecode form is the result of 1054 transforming its abstract yield into a sequence of forms who have the 1055 usual semantics aof BULK forms whose first expression is of type 1056 "Function". When evaluating a bytecode form that doesn't provide 1057 arities, a processing application MUST abort this transformation as 1058 soon as it encounters a reference for which it cannot determine if it 1059 is an operator or an operand or an operator of unkown arity. When 1060 evaluating a bytecode form that provides arities, any reference that 1061 is not known to be an operator MUST be determined not to be an 1062 operator. 1064 To transform a prefix bytecode abstract yield, a processing 1065 application creates an alternate context. If the first expression of 1066 the bytecode can be determined not to be an operator, it is removed 1067 from the beginning of the bytecode and appended as an atom at the end 1068 of the alternate context. If the first expression of the bytecode 1069 can be determined to be an operator, it is removed from the beginning 1070 of the bytecode along with as many next expressions as its arity and 1071 they all are appended as a form in the alternate context. The 1072 transformation continues until the bytecode is empty, in which case 1073 the alternate context becomes the inner context of the bytecode form 1074 and the transformation is complete. 1076 To transform a postfix bytecode form, a processing application 1077 creates an alternate context. If the first expression of the 1078 bytecode can be determined not to be an operator, it is removed from 1079 the beginning of the bytecode and appended as an atom at the end of 1080 the alternate context. If the first expression of the bytecode can 1081 be determined to be an operator, it is removed from the beginning of 1082 the bytecode and as many expressions as its arity are removed from 1083 the end of the alternate context. They all are appended as a form in 1084 the alternate context (with the operator as first element followed by 1085 the operands, kept in their previous order). The transformation 1086 continues until the bytecode is empty, in which case the alternate 1087 context becomes the inner context of the bytecode form and the 1088 transformation is complete. 1090 If the overhead of several marker bytes in the operands of some 1091 operators is too much, even more compactness can be achieved by 1092 packing together small operands. For example, instead of an operator 1093 with two integers as its operands, one could specify an operator to 1094 take a single word as operand and extract the integers from it (while 1095 still retaining the ability to operate on many sizes of integers, 1096 because it can still deduce the size of the integers by dividing the 1097 size of the word by two). 1099 For example, a BULK format representing player moves with a pair of 1100 coordinates might represent a single move with the following shapes: 1102 classical (8 bytes) "( sgf:black/2 w8 0x04 w8 0x10 )" 1104 packed classical (7 bytes) "( sgf:black/1 w16 0x04 0x10 )" 1106 bytecode (6 bytes) "sgf:black/2 w8 0x04 w8 0x10" 1108 packed bytecode (5 bytes) "sgf:black/1 w16 0x04 0x10" 1110 The transformation defined for the bytecode forms makes it possible 1111 to mix literal expressions and operations represented by a sequence 1112 of operators and operands. In the previous scenario, for example, 1113 one might represent alternating moves by two players as a sequence of 1114 words, lowering the weight of each move to 3 bytes when coordinates 1115 are below 256. The difference between all these schemes and an array 1116 is that you keep the ability to insert other forms, for example to 1117 represent comments on the game or variants. 1119 The cost of the bytecode format is that if it contains operators 1120 whose arity is unknown to a processing application, the whole 1121 sequence after the first occurrence of them is unreadable to that 1122 processing application, whereas in the classical format, the 1123 processing application can still process all the forms it understands 1124 (and it requires no anticipation by the application creating the BULK 1125 stream). 1127 3.1.10.1. Prefix bytecode 1129 name "0x30" (mnemonic: "prefix-bytecode") 1131 shape "( prefix-bytecode {bytecode} )" 1133 This is a prefix bytecode form that doesn't provide arities. 1135 3.1.10.2. Prefix bytecode with arities 1137 name "0x31" (mnemonic: "prefix-bytecode*") 1139 shape "( prefix-bytecode* ( {arities} ) {bytecode} )" 1141 This is a prefix bytecode form that provides arities. 1143 {arities} MUST be a sequence of shapes "( {arity}:Int {refs} )". 1144 {refs} MUST be a sequence of references. It indicates that all 1145 references in this sequence are operators of arity {arity}. 1147 3.1.10.3. Postfix bytecode 1149 name "0x32" (mnemonic: "postfix-bytecode") 1151 shape "( postfix-bytecode {bytecode} )" 1153 This is a postfix bytecode form that doesn't provide arities. 1155 3.1.10.4. Postfix bytecode with arities 1157 name "0x33" (mnemonic: "postfix-bytecode*") 1159 shape "( postfix-bytecode* ( {arities} ) {bytecode} )" 1161 This is a postfix bytecode form that provides arities. 1163 {arities} MUST be a sequence of shapes "( {arity}:Int {refs} )". 1164 {refs} MUST be a sequence of references. It indicates that all 1165 references in this sequence are operators of arity {arity}. 1167 3.1.10.5. Arity declaration 1169 name "0x34" (mnemonic: "arity") 1171 shape "( arity {arity}:Int {refs} )" 1173 {refs} MUST be a sequence of references. It indicates that all 1174 references in this sequence are operators of arity {arity}. 1176 3.1.10.6. Property list 1178 name "0x35" (mnemonic: "property-list") 1180 shape "( property-list {bytecode} )" 1182 {bytecode} MUST be a sequence of expression in which the first and 1183 every odd-numbered expression is a reference that will be taken as 1184 having arity 1. 1186 The semantics of "( property-list foo:bar ( frac 2 3 ) foo:baz true 1187 foo:quux "abc" )" SHOULD be same than of "( foo:bar ( frac 2 3 ) ) ( 1188 foo:baz true ) ( foo:quux "abc" )". 1190 4. Extension namespaces 1192 Extension namespaces are defined with a unique identifier, to be 1193 associated to a marker value. 1195 By its decentralized nature, as far as a processing application is 1196 concerned, apart from standard namespaces, there is no difference 1197 between a namespace defined as part of the official BULK suite and a 1198 user-defined one. 1200 5. Profiles 1202 A profile is a byte sequence parsed by a processing application just 1203 after the "version" form or before the first expression if there is 1204 no "version" form. Thus a parser SHOULD look ahead at the beginning 1205 of a stream to see if the first three bytes are "( bulk:version". 1206 With respect to the BULK stream, the profile is an out-of-band 1207 information, usually implicit. 1209 A processing application doesn't need to include the profile in the 1210 concrete yield, as long as the semantics of the abstract yield are 1211 maintained. 1213 The same BULK stream might be processed with different profiles. 1215 A processing application MUST NOT deduce the profile from the content 1216 of a BULK stream. 1218 5.1. Profile redundancy 1220 A processing application SHOULD only rely on the use of a profile 1221 when it is a safe assumption that the profile is known, for example 1222 within a communication where the protocol dictates the profile. 1224 In particular, long-term storage of a BULK stream SHOULD preserve 1225 profile information, for example with a media type that dictates the 1226 profile. 1228 Otherwise, an application writing a BULK stream in a long-term 1229 storage SHOULD include the profile after the version form. For this 1230 reason, the expressions in a profile SHOULD have idempotent 1231 semantics. 1233 5.2. Standard profile 1235 This specification defines the default profile that a processing 1236 application MUST use when it is not using a specific profile: 1238 "( bulk:stringenc ( bulk:iana-charset 106 ) )" 1240 This means that the default string encoding in a BULK stream is UTF- 1241 8. 1243 6. Security Considerations 1245 6.1. Parsing 1247 Parsing a BULK stream is designed to be free of side-effects for the 1248 processing application, apart from storing the parsed results. 1250 Arrays in BULK carry their size, so as for the application to know in 1251 advance the size of the data to read and store, thus making it easier 1252 to build robust code. A malicious software, however, may announce an 1253 array with a size choosen to get an application to exhaust its 1254 available memory. When a BULK stream has been completely received, 1255 an array bigger than the remaining data SHOULD trigger an error. 1256 When a BULK stream's size is not known in advance, the application 1257 SHOULD use a growable data structure. 1259 6.2. Forwarding 1261 When a processing application forwards all or part of the data in a 1262 BULK stream to another application, care must be taken if part of the 1263 forwarded data was not entirely recognized, as it could be used by an 1264 attacker to benefit from the authority the forwarding application has 1265 on the recipient of the data. 1267 6.3. Definitions 1269 The architecture of a processing application SHOULD ensure that a 1270 malicious agent cannot abuse authority given to it to define a 1271 namespace in order to modify associations in other namespaces. 1272 Depending on the use of data structures storing BULK expressions, 1273 this could amount to giving an attacker a way to manipulate the 1274 application's state. See Appendix A for an example of architecture 1275 that is resistant to that kind of attack. 1277 7. IANA Considerations 1279 This specification defines a new media type, application/bulk. Here 1280 are the informations for its registration to IANA: 1282 Type name application 1284 Subtype name bulk 1285 Required parameters none 1287 Optional parameters none 1289 Encoding considerations none, content is self-describing 1291 Security considerations cf. Section 6 1293 Interoperability considerations the constraint to start any BULK 1294 stream with a version form has the side-effect that classes of 1295 BULK streams can be identified by a sequence of bytes acting as 1296 "magic number": 1298 0x012000 any BULK stream 1300 0x01200004 a BULK stream of any major version beneath 256 1302 0x0120000401 a BULK stream of major version 1 1304 0x0120000401040002 a BULK stream of version 1.2 1306 Published specification this document 1308 Applications that use this media type none so far 1310 Fragment identifier considerations this specification defines no 1311 semantics for addressing the data with a fragment identifier; a 1312 future specification MAY define fragment identifier syntaxes to 1313 address the content by byte offset or the parsed results by their 1314 position in the yielded sequence 1316 Additional information a future specification MAY define a naming 1317 convention for media types based on bulk with a +bulk suffix, as 1318 for XML with +xml 1320 8. Acknowledgements 1322 The original author of this specification read Erik Naggum's famous 1323 rant about XML [1] several years before, and while forgotten as such, 1324 it clearly was the seed that slowly bloomed into the design of BULK. 1325 This format is dedicated to Erik. 1327 9. References 1328 9.1. Normative References 1330 [IANA-Charsets] 1331 "IANA Charset Registry (archived at):", 1332 . 1334 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1335 Requirement Levels", BCP 14, RFC 2119, March 1997. 1337 9.2. Informative references 1339 [Avro] Cutting, D., "Apache Avro[TM] 1.7.4 Specification", 1340 February 2013, 1341 . 1343 [HTTP2] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 1344 Transfer Protocol version 2 (HTTP/2)", RFC 7540, May 2015. 1346 [protobuf] 1347 "Protocol Buffers", July 2008, 1348 . 1350 [Smile] Saloranta, T., "Smile Data Format", September 2010, 1351 . 1353 [Thrift] Slee, M., Agarwal, A., and M. Kwiatkowski, "Thrift: 1354 Scalable Cross-Language Services Implementation", April 1355 2007, . 1358 9.3. URIs 1360 [1] http://www.schnada.de/grapt/eriknaggum-xmlrant.html 1362 Appendix A. Robust namespace definition 1364 This constitutes a suggestion of architecture for a BULK processing 1365 application. It has the advantage that an agent cannot modify the 1366 values of names to which it has not specifically been given 1367 authority. This architecture doesn't ensure this property by 1368 checking the validity of definitions but by adhering to the Principle 1369 Of Least Authority, thus ensuring no false positives or TOCTOU race 1370 conditions. 1372 For each new context (including the abstract yield when parsing 1373 starts), the parser creates a new copy of each known namespace. 1374 These copies are available in this context to retrieve and define 1375 values. It implements the lexical scoping of definitions on top of 1376 providing the robustness properties discussed here. 1378 By default, all namespaces created in a context are discarded at the 1379 end of this context. 1381 Of course, an implementation of the architecture presented here can 1382 be optimized compared to the abstract algorithm, for example by using 1383 copy-on-demand. 1385 Any namespace that is not a copy for its context but the object 1386 retained by the application afterwards, gives authority to make long- 1387 lasting definitions. Such a namespace is called lasting here. 1389 A.1. Selective authority 1391 A number of lasting namespaces are included for the abstract yield. 1392 Their unique identifiers are agreed out-of-band. The disadvantage of 1393 this solution is that it needs prior agreement on the definable 1394 namespaces. 1396 A.2. Open authority 1398 Any "ns" form for a unique identifier unknown to the processing 1399 application triggers the creation of a lasting namespace. 1401 The disadvantage of this solution is that it opens a denial of 1402 service vulnerability. If Bob is a processing application and Carol 1403 and Dave are agents communicating with Bob with an open authority, 1404 Dave can prevent Carol from defining a namespace if it manages to 1405 know the unique identifier and starting a communication with Bob 1406 before Carol. 1408 If an agent uses a secure way to create unique identifiers, this 1409 solution is both flexible and safe (the burden is not on the BULK 1410 processing application). 1412 Appendix B. Verifiable namespace bootstrap 1414 If a processing application that implements one or several hashing 1415 algorithms encounters a BULK stream with namespaces identified by 1416 UniqueID forms defined in an unknown namespace, it would be possible 1417 for the application to recover that namespace's definition and still 1418 verify it, as shown in the following process. 1420 The processing application reads a BULK stream starting with "( 1421 bulk:version 1 0 ) ( ns w8 0x28 ( 0x28 0xC w32 0xFD 0x2A 0x34 0x02 ) 1422 ( ns w8 0x29 ( 0x28 0xC w32 0x24 0xA3 0x58 0xF3 )". This means that 1423 the namespace identified by FD2A3402 is associated with marker 40, 1424 and a form from that namespace is used to identify itself. A second 1425 namespace, associated with marker 41, is identified by 24A358F3 with 1426 the same form taken from the previous namespace. 1428 By whatever available mechanism to aquire BULK namespaces' 1429 definitions (which could be reading local configuration files or 1430 making a search on the Internet), the processing application gets the 1431 following definition for the namespace identified by FD2A3402: "( 1432 bulk:version 1 0 ) ( bulk:verifiable-ns w8 0xF0 ( 0xF0 0xC w32 0xFD 1433 0x2A 0x34 0x02 ) "crypto" ( bulk:mnemonic/def 0xF0 0xC "md9" ) )". 1434 It can now try every hashing algorithm known to it and check which 1435 one hashes the byte sequence ""crypto" ( bulk:mnemonic/def 0xF0 0xC 1436 "md9" )" into FD2A3402. If it finds one, from now on, the processing 1437 application has verified this namespace and can verify any other use 1438 of that crypto:md9 reference. 1440 Author's Address 1442 Pierre Thierry 1443 Thierry Technologies 1445 EMail: pierre@nothos.net