idnits 2.17.1 draft-rfernando-protocol-buffers-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (October 8, 2012) is 4211 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2119' is mentioned on line 99, but not defined Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET-DRAFT S. Stuart 3 Intended Status: Proposed Standard Google 4 Expires: April 11, 2013 R. Fernando 5 Cisco 6 October 8, 2012 8 Encoding rules and MIME type for Protocol Buffers 9 draft-rfernando-protocol-buffers-00 11 Abstract 13 This document describes the encoding format for Protocol Buffers 14 encoded data and registers a MIME type associated with Protocol 15 Buffers encoded data. 17 Status of this Memo 19 This Internet-Draft is submitted to IETF in full conformance with the 20 provisions of BCP 78 and BCP 79. 22 Internet-Drafts are working documents of the Internet Engineering 23 Task Force (IETF), its areas, and its working groups. Note that 24 other groups may also distribute working documents as 25 Internet-Drafts. 27 Internet-Drafts are draft documents valid for a maximum of six months 28 and may be updated, replaced, or obsoleted by other documents at any 29 time. It is inappropriate to use Internet-Drafts as reference 30 material or to cite them other than as "work in progress." 32 The list of current Internet-Drafts can be accessed at 33 http://www.ietf.org/1id-abstracts.html 35 The list of Internet-Draft Shadow Directories can be accessed at 36 http://www.ietf.org/shadow.html 38 Copyright and License Notice 40 Copyright (c) 2012 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 56 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 3 57 2. Message Structure . . . . . . . . . . . . . . . . . . . . . . . 3 58 3. Encoding Rules . . . . . . . . . . . . . . . . . . . . . . . . 4 59 3.1 Numbers as VarInts . . . . . . . . . . . . . . . . . . . . . 5 60 3.2 Encoding and Interpretation of Protobuf Messages . . . . . . 5 61 3.3 Wire Types . . . . . . . . . . . . . . . . . . . . . . . . . 5 62 3.3.1 Wire Type 0 . . . . . . . . . . . . . . . . . . . . . . 5 63 3.3.2 Wire Type 1 . . . . . . . . . . . . . . . . . . . . . . 6 64 3.3.3 Wire Type 2 . . . . . . . . . . . . . . . . . . . . . . 6 65 3.3.4 Wire Type 5 . . . . . . . . . . . . . . . . . . . . . . 6 66 4. Embedded Messages . . . . . . . . . . . . . . . . . . . . . . . 7 67 5. Optional and Repeated Elements . . . . . . . . . . . . . . . . 7 68 6. Field Order . . . . . . . . . . . . . . . . . . . . . . . . . . 7 69 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 9 70 8. Security Considerations . . . . . . . . . . . . . . . . . . . . 9 71 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 9 72 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 73 10.1 Informative References . . . . . . . . . . . . . . . . . . 9 74 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10 76 1 Introduction 78 Protocol buffers, referred to as protobuf in this document, is a 79 commonly used interchange format to serialize structured data for 80 storage and transmission between applications and systems. It 81 supports simple and composite data types and provides rules to 82 serialize those data types into a portable format that is both 83 language and platform neutral. Since it encodes data into binary 84 format, it is fast and efficient. It is also supported by a wide 85 variety of programming languages. 87 While protocol buffers has gained wide spread use, it has so far been 88 described only informally and has not been standardized. This 89 document specifies the encoding rules for protobuf and registers the 90 MIME type 'application/protobuf' for it in accordance with RFC 2048. 92 This document heavily borrows ideas from web page [GPBENC]. 94 1.1 Terminology 96 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 97 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 98 document are to be interpreted as described in RFC 2119 [RFC2119]. 100 2. Message Structure 102 Protobuf defines all data elements in discrete units called 103 "messages" [GPBOVW]. A message is a logical collection of related 104 data items. It is similar to a "record" or a "structure" in a 105 traditional programming language. Many standard simple data types are 106 available as field types, including bool, int32, float, double and 107 string. One can also add further structure to the outer message by 108 using enums and other messages as field types. 110 The following is an example of a message definition in protobuf: 112 message Person { 114 enum PhoneType { 115 MOBILE = 0; 116 HOME = 1; 117 WORK = 2; 118 } 120 required string name = 1; 121 required int32 id = 2; 122 optional string email = 3; 123 message PhoneNumber { 124 required string number = 1; 125 optional PhoneType type = 2; 126 } 128 repeated PhoneNumber phone = 4; 129 } 131 Note the presence of simple data types such as strings and int32s as 132 well as complex data types such as enums and messages in the above 133 message definition. 135 Each field is annotated with one of the following three modifiers: 137 1. required: a value for the field must be provided, otherwise the 138 message will be considered malformed and the decoding entity will 139 throw and exception. 141 2. optional: the field may or may not be set. If an optional field 142 value isn't set, a default value is used. 144 3.repeated: the field may be repeated any number of times (including 145 zero). The order of the repeated values will be preserved in the 146 protocol buffer encoding. 148 The integer token to the right of the assignment operator is a field 149 number. These field numbers uniquely identify a field in a message 150 and together with the wire type is used to form the key for the key- 151 value pairs in the serialized data stream. Field numbers 1-15 require 152 one less byte to encode than higher numbers, so as an optimization 153 one can decide to use those field numbers for the commonly used or 154 repeated elements. Each element in a repeated field requires re- 155 encoding the field number, so repeated fields are particularly good 156 candidates for this optimization. 158 This document will not describe every syntactic element of the 159 protbuf language but will restrict discussion to only those elements 160 that are relevant to the encoding and decoding of data types. 162 3. Encoding Rules 164 This section describes the encoding rules for the different field 165 types. 167 3.1 Numbers as VarInts 169 To understand protobuf encoding, we need to first understand 170 VarInts. 172 All numbers in protobuf are represented as base 128 variable-length 173 integers (or VarInt). VarInt is an encoding scheme that uses only as 174 many bytes as is necessary to represent a number and it can be used 175 to encode arbitrary large numbers. It achieves this by using a 176 continuation bit in every byte. Each byte in a VarInt, except the 177 last byte, has the most significant bit (msb) set indicating that 178 there are more bytes to come. The last byte has the msb set to zero. 179 The stream of 7-byte quantities (after msb has been removed) are then 180 reversed and concatenated to produce one single binary representation 181 of the number. 183 3.2 Encoding and Interpretation of Protobuf Messages 185 Protobuf messages are not self describing. In other words, the entity 186 decoding the binary representation of the message needs to refer to 187 the equivalent text definition of the message to interpret the 188 fields. The "tag" that's associated with the field (with the "=" sign 189 in the text definition) indicates to the decoder which field it is 190 looking at currently. 192 To achieve backward compatibility a wire-type is also included for 193 every field. Using the wire-type, the decoder can skip a field 194 without interpreting it if it desires to do so. This can be useful to 195 achieve backward compatibility when the decoder is not aware of a 196 particular field's tag value. 198 Every field is encoded as a (key, value) pair. The key is a VarInt 199 with the value ((field-tag << 3) | wire-type). In other words, the 200 last three bits of the key VarInt is the wire type. 202 3.3 Wire Types 204 This document defines the following wire types, their interpretation 205 and the data types that they are used for. 207 3.3.1 Wire Type 0 209 If the wire type is 0, the value field is simply a VarInt. This 210 encoding is used to represent int32, int64, uint32, uint64, sint32, 211 sint64, bool and enum. For positive integers the interpretation of 212 the VarInt is straight forward as explained in section 3.1. 214 For example, consider the following message, 215 message Test1 { 216 required int32 a = 1; 217 } 219 would be serialized as '08 96 01'. 221 If int32 and int64 are used for encoding negative integers, the 222 resulting VarInt is always a ten byte quantity (effectively treating 223 it as a large unsigned integer). If a singed type is used, a zigzag 224 encoding scheme is used which assigns small VarInt values for small 225 negative numbers. In this scheme, the numbers -2, -1, 0, 1, 2 would 226 be represented as VarInts 3, 1, 0, 2, 4 and so on. Mathematically, 227 each value 'n' is encoded using (n << 1) ^ (n >> 31) for sint32 or (n 228 << 1) ^ (n >> 63) for sint64. 230 3.3.2 Wire Type 1 232 This is a fixed length 64-bit quantity. This wire type is used to 233 represent fixed64, sfixed64 and double data types. The value is 234 stored in little-endian format. 236 3.3.3 Wire Type 2 238 This is a length delimited stream of bytes. The value field is a 239 VarInt encoded length followed by the specified number of bytes of 240 data. 242 As an example, consider the following message, 244 message Test2 { 245 required string b = 2; 246 } 248 would be serialized as, '12 0b 68 65 6c 6c 6f 20 77 6f 72 6c 64', if 249 the string 'b' was set to "Hello World". 251 3.3.4 Wire Type 5 253 This is a fixed length 32-bit quantity. This wire type is used to 254 represent fixed32, sfixed32 and float data types. The value is stored 255 in little-endian format. 257 4. Embedded Messages 259 Embedded messages are encoded as follows. The inner (or the embedded) 260 message is serialized first using the rules described above. The 261 resultant byte stream is then treated as a Wire Type 2 field in the 262 outer message and added to its encoding. 264 Consider the example, 266 message Test1 { 267 required int32 foo = 1; 268 } 270 message Test2 { 271 required Test1 c = 3; 272 } 274 If the field 'foo' were to take the value 150, the resultant encoded 275 byte stream for the inner message would be 08 '96 01'. And for Test2 276 would be '1a 03 08 96 01'. 278 5. Optional and Repeated Elements 280 If the message definition has 'repeated' elements, then the encoded 281 message has zero or more key-value pairs with the same field number. 282 These repeated values do not have to appear consecutively; they may 283 be interleaved with other fields. 285 If the message definition has 'optional' elements, then the encoded 286 message may or may not have a key-value pair with that field number. 288 A repeated field could be a 'packed repeated field' in which case the 289 encoding for the field is slightly different. A packed repeated field 290 containing zero elements does not appear in the encoded message. 291 Otherwise, all of the elements of the field are packed into a single 292 key-value pair with the wire type 2 (length delimited). Each element 293 is encoded the same way it would be normally, except without a field 294 number preceding it. 296 6. Field Order 298 When a message is serialized its known fields should be written 299 sequentially by field number. This allows parsing code to use 300 optimizations that rely on field numbers being in sequence. However, 301 protocol buffer parsers must be able to parse fields in any order, as 302 not all messages are created by simply serializing an object - for 303 instance, it's sometimes useful to merge two messages by simply 304 concatenating them. 306 7. IANA Considerations 308 The MIME media type for protobuf messages is application/protobuf. 310 Type name: application 312 Subtype name: protobuf 314 Required parameters: n/a 316 Optional parameters: n/a 318 Encoding considerations: 8 bit binary, UTF-8 320 Security considerations: 321 Generally there are security issues with serialization formats 322 if code is transmitted and executed on the decoder end. Since 323 protobuf binary encoding does not carry code, we consider the 324 encoding scheme itself to not introduce any security risks. 326 8. Security Considerations 328 See section 7. 330 9. Acknowledgements 332 We thank the engineers at Google for giving us the protocol buffers 333 serialization format. All the concepts described in this document 334 come from web pages [GPBENC, GPBOVW] defining protocol buffer 335 mechanisms. This document is merely an attempt to standardize those 336 mechanisms in IETF and assign a MIME type for protobuf encoded 337 messages. 339 10. References 341 10.1 Informative References 343 [GPBENC] Google Protocol Buffer Encoding, 344 https://developers.google.com/protocol-buffers/docs/encoding 346 [GPBOVW] Google Protocol Buffer Overview, 347 https://developers.google.com/protocol-buffers/docs/overview 349 Authors' Addresses 350 Stephen Stuart 351 Google 352 1600 Amphitheatre Parkway 353 Mountain View, CA 94043 354 USA 356 EMail: sstuart@google.com 358 Rex Fernando 359 Cisco Systems 360 170 W. Tasman Dr. 361 San Jose, CA 95134 363 Email: rex@cisco.com