INTERNET DRAFT Colin Low, Jim Randell, Mike Wray Hewlett Packard Laboratories October 1997 Self-Describing Data Representation (SDR) Status of this Document This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." To view the entire list of current Internet-Drafts, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. This document expires on April 23 1998 Abstract This document describes a human-readable, textual syntax for representing self-describing structured data. This representation was designed as a transfer syntax for loosely-coupled distributed applications where one cannot depend on sender(s) and receiver(s) sharing a schema for exchanged data. The syntax is compact, expressive, intuitive, and simple to implement. 1. Introduction A traditional assumption in distributed systems is that structured data exchanged between applications is marshalled into an efficient [Page 1] INTERNET DRAFT Self-Describing Data Representation October 1997 binary representation, and the receiver shares enough state with the sender to unmarshal this representation. This is usually achieved through a shared schema, in the form of shared header files, or an interface description language (IDL) used to generate the marshaling and unmarshaling routines for both sender and receiver. This assumption is being eroded by the growth of distributed applications on the Internet, where one can no longer make the assumption that sender and receiver have shared state at some time in the past. In some applications sender and receiver are intentionally anonymous [1] and so data must of necessity be free-standing and self-describing. Application-specific syntaxes which employ self-description do exist (e.g. Mail, News [2], HTTP [3]), but they lack the expressive power required for more general use. The syntax described here was designed to meet the following goals: Self-describing. We do not assume the recipient of data is familiar with its schema, and so data values must include additional information to identify the values, and possibly also the representation used for each data value. Schema Tolerance. In a loosely-coupled and evolving distributed system with millions of participants, it may be difficult or impossible to ensure that all participants use the same schema for interpreting received data. The onus is on the recipient to interrogate the received data to ensure that it contains the minimum information required to make further progress. Expressiveness. The syntax supports structured data values in the form of maps and lists to any level of containment. Human Readability. Free-standing data needs to be (within reasonable limits) self-documenting. Compactness. We wanted a syntax that was efficient to transport and parse. Transport Independence. Self-describing data can (and should) be decoupled from the method used to transport it - it should be able to "ride piggyback" on any nominated transport. As most transports support the transfer of text, this means that the syntax described here can be employed with CORBA IIOP, ASN.1 Encoding Rules, OSF [Page 2] INTERNET DRAFT Self-Describing Data Representation October 1997 DCE, Electronic Mail, TCP, and UDP. Platform and language independence. Accessibility and simplicity. Excessive complexity militates against widespread use. Our goal was that an average programmer should be able to write a parser for the syntax in a couple of days. In order to provide a brief overview and example of the syntax, we have used the syntax to describe information of general interest to users of public transport, in the form of a "bus location update" notification: notification: { !An advanced customer service from the Speedy Bus Corporation type { ... }, content { type (omnibus speedy-bus location-update), bus-id "23", date USDate:"091797", time 24hour:"19:36:50", latitude (59 43 21), longitude (54 23 19), vrml "http://www.bus-company.com/vrml/bus.wrl" }, system { ... } } As a public service the Speedy Bus Corporation has attached GPS receivers to all its buses, and makes location information available about each bus at regular intervals. Someone receiving this information can interpret it and build applications - for example, it would be straightforward to create a map showing the current positions of all the buses, or use regular information about a single bus to estimate the congestion on a known route. Anyone receiving the information above can begin to make sense of it - there is no need to contact the bus company for information about how it encodes its location updates. Much of the information [Page 3] INTERNET DRAFT Self-Describing Data Representation October 1997 represented is self-evident. In more detail, the basis for self-description is the map, denoted by curly brackets {}. A map is an unordered set of name/value pairs. In the example above, the outermost map contains three further maps, named 'type', 'system' and 'content'. Because the SDR syntax is schema-tolerant, we don't need to know anything about the information in the 'type' and 'system' maps - we can effectively ignore them. If someone added a fourth map called 'authentication', it would have no impact on existing applications, which could ignore it. In this application the 'content' map contains application-specific information, such as the position of the bus. We can use as little or as much of the information as we need. Let us suppose that the initial application supplied only the latitude and longitude of the bus. Later, in response to a 3D VRML town plan, the bus company supplies a URL that points to a 3D model of the bus, so that application writers can add the bus dynamically to the town model. The addition of a new 'vrml' element to the 'content' map will not impact older applications, which will continue to use only the position information. This resilience to extensions to the schema is one of the most powerful features of SDR - loosely-coupled applications implemented on a large scale must tolerate this kind of incremental change and enhancement. The rule we apply is that an SDR parser should be able to accept syntactically correct SDR without prior knowledge of a schema, and present it to an application for interpretation. It is up to the application to decide whether the minimum information required is present. For example, let us suppose the Reliable Bus Corporation also provides bus location notifications, but has not implemented the pointer to a VRML model of a bus. An application could work with both the Speedy and the Reliable bus notifications. If the 'vrml' field exists, the application could load the VRML model, otherwise it could substitute a default model. Clearly some minimal knowledge about the schema for notifications is necessary, but an application using SDR is insulated from additions and extensions to a schema. The capability to unmarshal SDR in the absence of a schema leads to new kinds of application capable of carrying out a variety of purely syntactic functions on data-filtering, distribution and rewriting. For example, using the bus notification example above, it is possible to create an application that filters notifications using predicate expressions involving the fields of the notification. This filter [Page 4] INTERNET DRAFT Self-Describing Data Representation October 1997 application needs to know nothing about the application (bus location updates) or the meaning of the fields used in predicates (e.g. latitude and longitude). 2. Data Model SDR provides for the representation of structured data in three ways: map values, list values, and atomic values. 2.1 Map Values A map is an unordered set of name/value pairs. The map { firstname "John", lastname "Doe" } contains two pairs, with braces {} used to delimit the map. The ordering of pairs in a serial representation of a map is undefined - any order of pairs is a valid order. Any atom (byte sequence) can be used to name a value in a map, subject to the restriction that a name may only be associated with a single value in the same map; that is, names are unique within the map. Values may be of any type - maps are not homogeneous - and can consist of further maps or lists as well as Atomic values. Map values provide the foundation for self-describing data, and are intended to be heavily used in applications of SDR. 2.2 List Values List values allow a collection of values to be represented. There is no requirement that the list be homogeneous. The list ( 3 "Foobar" { firstname "John" lastname "Doe" } ) is a list consisting of the atom 3, the atom "Foobar", and the map value from the previous section. Round brackets () delimit the list. Lists can be used to represent sets where the number of elements is not known in advance - for example, a list of peoples' names. Lists can also be used to represent ordered data, such as lists of numbers. Lists should be avoided in cases where an implicit assumption is made [Page 5] INTERNET DRAFT Self-Describing Data Representation October 1997 about the interpretation of list elements - for example, the first element is a name, the second element is an age, the third element is the person's sex, and so on. This interpretation requires implicit state shared between the sender and the receiver. That is, the list ("John Doe" 35 "male") would be better represented by the map { name "John Doe", age 35, sex "male" } which provides a more robust decoupling between sender and receiver - new elements can be added to the map without affecting the receiver. If the receiver only needs to know the age (for a statistical calculation) then it can extract that without having to interpret the rest of the information in the map. This difference between maps and lists illustrates the philosophical difference between conventional serial representations of data, and self-describing data. 2.3 Atomic Values Atomic values are used to represent the data at the leaf nodes of an SDR represenation. An arbitrary sequence of bytes can be represented as an atomic value. Typically any byte sequence represented by an atomic value is treated as an indivisible whole by the system and not subject to further structural breakdown - hence these byte sequences are known as atoms. 2.4 Tags A value may have a tag associated with it. A tag is an atom associated with a value that may be used at the application level to denote the intended high-level type of the value. For example: Person: { firstname "John", lastname "Doe" } indicates a map with the tag 'Person'. Tags are useful in increasing the comprehensibility of the syntax, but more importantly, were [Page 6] INTERNET DRAFT Self-Describing Data Representation October 1997 intended to be used in conjunction with data types. For example, UKDate: "010997" and USDate: "010997" use the same string of numbers to represent different calendar dates (1st September 1997 in former case, 9th January 1997 in the latter). The astute reader will have noticed that shared types (tags) reintroduce the need for shared state between sender and receiver. Tags are optional. When used sparingly they can be used to enrich the intentionally limited range of atomic types in SDR by imposing application semantics on atomic types (e.g. representing a date as a string). In many applications tags are unnecessary - there are common notations in programming languages such as 'C' for representing a small number of different types of value such as integers, booleans, strings, and floating point numbers. SDR has a canonical representation for a small number of value types, and values recognised as such (for example, an integer) are supplied with an implicit tag. Further details on the representation of values and tagging are given below. 3. Data Representation SDR is a human-readable concrete syntax to represent the data values described above as plain text. 3.1 Atoms There is no single best method to represent an atomic value, and so four different methods of representation are provided which cover most common requirements. An atom can be introduced in four ways: as a token, a string, as counted data, and as quoted data. Each of these is a way of introducing a sequence of bytes and any of them can be used to denote an atom. It is normally obvious which representation to use - for example, numbers are normally represented as tokens, and text as strings. 3.1.1 Tokens [Page 7] INTERNET DRAFT Self-Describing Data Representation October 1997 SDR is based on an octet stream. Certain 8-bit values are reserved which correspond to the ASCII (7-bit) values for the SDR meta- characters (see 3.7). If a non-empty atom consists entirely of octets equivalent to the following ASCII characters *or* contains octets whose values are greater than 0x7f (hexadecimal) it can be represented directly: ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789$%&*+-.@?/_^~;<=>[]'`| Atoms expressed in this limited alphabet are known as tokens. This convention is equivalent to support for UTF8-encoded Unicode. Tokens are an efficient and readable representation for: boolean values: true, false integer values: 496 3 -89 floating point values: 1.333 -5.9+e9 names for map values: firstname lastname tags: USDate UKDate 3.1.1.1 Examples Each of the following is a valid atom introduced as a token and representing the given byte sequence (represented in hex): event <-> '65/76/65/6E/74' < <-> '3C' <= <-> '3C/3D' = <-> '3D' x[4] <-> '78/5B/34/5D' 42 <-> '34/32' return-template <-> '72/65/74/75/72/6E/2D/74/65/6D/70/6C/61/74/65' If an atom is not drawn from the limited alphabet it can be represented in one of the three remaining ways: as a string, as quoted data, or as counted data. Each of these ways can introduce arbitrary atoms (i.e. arbitrary byte sequences), whereas the token representation can only be used for a subset of atoms. 3.1.2 Strings An atom can be introduced by representing it as a C-like string constant. The string is delimited with double quotes (" ") and may [Page 8] INTERNET DRAFT Self-Describing Data Representation October 1997 contain octal escape characters as well as the standard escape forms of some other non-printable characters. Atoms represented as strings consist of all bytes between the quote delimiters with escaped characters introduced by the backslash character (\). The following escaped forms are recognised in strings and are translated to the given byte (in hex): \b -> 08 (backspace) \f -> 0C (form feed) \n -> 0A (line feed) \r -> 0D (carriage return) \t -> 09 (horizontal tab) \\ -> 5C (backslash) \" -> 22 (double quote) \' -> 27 (single quote) \x where x is an octal number in the range 0-7 (hex: 00-07) \xx where xx is an octal number in the range 00-77 (hex: 00-3F) \xxx where xxx is an octal number in the range 000-377 (hex: 00-FF) Any other characters following a backslash are considered a parse error. 3.1.2.1 Examples Each of the following is a valid atom introduced as a string and representing the given byte sequence (represented in hex): "string" <-> '73/74/72/69/6E/67' "" <-> '' (empty byte sequence) "forty two" <-> '66/6F/72/74/79/20/74/77/6F' "\"pardon?\"" <-> '22/70/61/72/64/6F/6E/3F/22' "line 1\nline 2" <-> '6C/69/6E/65/20/31/0A/6C/69/6E/65/20/32' 3.1.3 Counted Data A byte sequence may be introduced as an atom by preceding it with #* (hash, star) followed by a byte count represented in ASCII decimal (characters hex: 30-39) followed by \ (backslash) followed by the bytes themselves. This format is particularly useful for programatic generation of data, where you know the length of the data you want to introduce, but you do not want to deal with the special quoting rules of [Page 9] INTERNET DRAFT Self-Describing Data Representation October 1997 strings. 3.1.3.1 Examples Each of the following is a valid atom introduced as counted data and representing the given byte sequence (represented in hex). To improve clarity, input characters are delimited by meta characters [ and ], which are typographic and not part of the Counted Data syntax. [#*10\some bytes] <-> '73/6F/6D/65/20/62/79/74/65/73' [#*0\] <-> '' (empty byte sequence) [#*2\ ] <-> '20/20' [#*9\"pardon?"] <-> '22/70/61/72/64/6F/6E/3F/22' 3.1.4 Quoted Data A byte sequence may be introduced as an atom by placing a delimiting sequence that does not occur within it on either side of the sequence. This representation is denoted by preceding it with #< (hash, less-than) followed by a single byte not occurring in the delimiter, the delimiter itself followed by the same single byte, then the atom itself followed by the single byte delimiter and the delimiter string. 3.1.4.1 Examples Each of the following is a valid atom introduced as quoted data and representing the given byte sequence (represented in hex). #<$END$some bytes$END -> '73/6F/6D/65/20/62/79/74/65/73' #<#x##x -> '' (empty byte sequence) #<*---* *--- -> '20/20' 3.2 Tags 3.2.1 Introduction Tags are a way to describe the intended interpretation of a value in SDR. For example: boolean: true boolean: 1 [Page 10] INTERNET DRAFT Self-Describing Data Representation October 1997 can both denote the same truth value. SDR builds compound values out of atoms, and the atoms are syntactically neutral, with no particular meaning intended. In the absence of implicit interpretations of atoms in SDR one would have to use 'int:32' and 'string:"32"' to indicate that the first atom is meant to be an integer and the second a string. In order to avoid this SDR has `implicit tags'. These implicit tags cause untagged atoms to be treated as `the obvious thing' by SDR systems. The following implicit tags have defined meanings: map, list, atom, string, num, int, float. The implicit tag of '32' is 'int', and the implicit tag of `"32"' is 'string', so we can use these representations without tags when we intend these interpretations. Since the implicit tag of '"32"' is `string' it is not an equivalent representation to '32', though 'int:"32"' is. The rule is that in using an alternative representation for the bytes of an untagged atom an explicit tag must be supplied if the alternative representation has a different implicit tag. It is possible that an SDR system may not want to parse integers and floats completely, but still use alternative representations. This is allowed by the use of the `num' tag. This means that the following atomic value is meant to be a number, but has not necessarily been checked for validity. SDR systems should not rewrite untagged atoms using the 'int' or 'float' tags unless they check validity. For example the following are equivalent forms: 32 int:"32" num:"32" int:#*2\32 int:#<[]'`| plus octet values over 0x7F Ascii = Any ASCII character QuotedData = '#<' data In OctalEscape and in Data types after initial #x, no white-space is permitted. 5. Implementation An implementation of SDR in Java, with source code, is available as a component of the Keryx Notification Service, available at http://keryxsoft.hpl.hp.com 6. Acknowledgements [Page 17] INTERNET DRAFT Self-Describing Data Representation October 1997 This syntax has been critiqued by several people. In particular, Soren Brandt and Anders Kristensen have implemented several variants of the syntax and provided many valuable suggestions. 7. References [1] The Keryx Notification Service - http://keryxsoft.hpl.hp.com [2] Standard for the format of Arpa Internet text messages, RFC 822, 1982 [3] Hypertext Transfer Protocol HTTP/1.0, RFC 1945, 1996 8. Author's Address Colin Low (editor) Hewlett Packard Laboratories, Filton Road, Stoke Gifford, Bristol BS12 6QZ UK Tel: +44 117 9799910 Email: cal@hplb.hpl.hp.com [Page 18]