Re: [XCON] On encodings
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [XCON] On encodings



As mentioned in the other post, this is interesting analysis, so I thought I'd try to add a bit more detail to it.

I've compared XML encodings with long tag names with a Megaco like encoding (in this case Lumas*) and a binary encoding (Lumas again).

*To justify the choice of Lumas encoding here, the original Lumas was an attempt to formalise the Megaco text encoding so that it could work for both text and binary. So Lumas has a lot in common with Megaco encoding. Indeed, change "context = 123 { ..." to "context = { 123 ..." and you're there!

Note that although Megaco is a text encoding (one of them anyway!), it probably has more similarities in terms of message length with a binary encoding than XML encoding. This is because in a number of cases Megaco has no tag, or short tag.

Tagging is obviously a big issue. To get an idea of the length of typical XML tags I grepped (well Perled really) out the element and attribute names defined in the CPCP draft. When doing this I got an average tag length of 12.7 characters. I noted that many of the names were of the form "a-long-tag-name", so I derived short tag names by taking the first letter of each sub-word. The average of this was 2.08 characters. (I've attached the Perl code used so that you can replicate my analysis if desired.)

As noted in the other mail, XSD does not lend itself to designing XML with small tags. If everything is called X (or similar) the XSD doesn't describe the message very well. So, although you can define an XSD with small tags, I think in practice you wouldn't! (Aside: This is why Lumas allows for specifying a long descriptive name in the message definition, and specifying a short name for tagging.)


For the binary encoding I used Lumas's binary rules. This uses variable length tag and length fields similar to how UTF-8 is variable length. As UTF-8 is commonly part of XML I can't see that this is an issue.


So here's the results I came up with:

Bool:
XML - Typical name: 5(<></>) + 2*12(tag) + 1(data) = 30
XML - Short name: 5(<></>) + 2 * 2(tag) + 1(data) = 10
Lumas - Text Long Name: 2(note 1) + 12(tag) + 1(data) = 15
Lumas - Text short name: 2(note 1) + 2(tag) + 1(data) = 5
Lumas - Text untagged: 1(note 2) + 1(data) = 2
Lumas binary: 2(tag+len) + 0(data note 3) = 2
Ratio XML Long/Lumas long: 2
Ratio XML long/Lumas short: 6
Ratio XML short/Lumas short: 2
Ratio XML long/binary: 15
Ratio XML short/binary: 15

16 bit int:
XML - Typical name: 5(<></>) + 2*12(tag) + 5(data) = 34
XML - Short name: 5(<></>) + 2 * 2(tag) + 5(data) = 14
Lumas - Text Long Name: 2(note 1) + 12(tag) + 5(data) = 19
Lumas - Text short name: 2(note 1) + 2(tag) + 5(data) = 9
Lumas - Text untagged: 1(note 2) + 5(data) = 6
Lumas binary: 2(tag+len) + 2(data) = 4
Ratio XML Long/Lumas long: 1.78
Ratio XML long/Lumas short: 3.7
Ratio XML short/Lumas short: 1.5
Ratio XML long/binary: 8.5
Ratio XML short/binary: 3.5

32 bit int:
XML - Typical name: 5(<></>) + 2*12(tag) + 10(data) = 39
XML - Short name: 5(<></>) + 2 * 2(tag) + 10(data) = 19
Lumas - Text Long Name: 2(note 1) + 12(tag) + 10(data) = 24
Lumas - Text short name: 2(note 1) + 2(tag) + 10(data) = 14
Lumas - Text untagged: 1(note 2) + 10(data) = 11
Lumas binary: 2(tag+len) + 4(data) = 6
Ratio XML Long/Lumas long: 1.6
Ratio XML long/Lumas short: 2.7
Ratio XML short/Lumas short: 1.3
Ratio XML long/binary: 6.5
Ratio XML short/binary: 3.1

string: http://example.com/conference/on (length = 32)
XML - Typical name: 5(<></>) + 2*12(tag) + 32(data) = 61
XML - Short name: 5(<></>) + 2 * 2(tag) + 32(data) = 41
Lumas - Text Long Name: 2(note 1) + 12(tag) + 2(quoting) + 32(data) = 48
Lumas - Text short name: 2(note 1) + 2(tag) + 2(quoting) + 32(data) = 38
Lumas - Text untagged: 1(note 2) + 2(quoting) + 32(data) = 35
Lumas binary: 2(tag+len) + 32(data) = 34
Ratio XML Long/Lumas long: 1.27
Ratio XML long/Lumas short: 1.6
Ratio XML short/Lumas short: 1.07
Ratio XML long/binary: 1.79
Ratio XML short/binary: 1.2

string: Sally (length = 5)
XML - Typical name: 5(<></>) + 2*12(tag) + 5(data) = 34
XML - Short name: 5(<></>) + 2 * 2(tag) + 5(data) = 14
Lumas - Text Long Name: 2(note 1) + 12(tag) + 2(quoting) + 5(data) = 21
Lumas - Text short name: 2(note 1) + 2(tag) + 2(quoting) + 5(data) = 11
Lumas - Text untagged: 1(note 2) + 5(data) = 6
Lumas binary: 2(tag+len) + 5(data) = 7
Ratio XML Long/Lumas long: 1.6
Ratio XML long/Lumas short: 3.0
Ratio XML short/Lumas short: 1.27
Ratio XML long/binary: 4.8
Ratio XML short/binary: 2

note 1: Lumas needs an = sign and a minimum of 1 white space char to separate values.

note 2: Lumas needs a minimum of 1 white space char to separate values.

note 3: The Lumas binary encoding encodes a bool into 2 bytes using a special length code.

Of course the overall message saving depends on how often each of these types comes up.

To me the analysis shows that a binary encoding is typically very much shorter than a typical XML encoding (with long tag names etc.).

BUT: Most of this gain is realised by using a Megaco like text encoding. (This is captured by the XML long/Lumas short ratio.) As this is what Lumas offers, rather than inventing yet another message encoding method, I would say that the debate should be between an XML encoding and a Lumas text encoding rather than between an XML encoding and a binary encoding!

Pete.
--
=============================================
Pete Cordell
Tech-Know-Ware Ltd
                        for XML to C++ data binding visit
                        http://www.tech-know-ware.com/lmx
                        (or http://www.xml2cpp.com)
=============================================

----- Original Message ----- From: "Henning Schulzrinne" <hgs at cs.columbia.edu>
To: "XCON-IETF" <xcon at ietf.org>
Sent: Friday, March 17, 2006 4:19 AM
Subject: [XCON] On encodings



Since we're on the perennially fun topic of protocol encodings, maybe we can get past the stereotypes and try to do an engineering comparison, or at least an attempt at an approximation. Roughly speaking, an XML element is equivalent in functionality to a TLV object: it contains a label (element tag and the 'type' field), length or delimiter fields to figure out the boundaries, and the actual data (value). ASN.1 BER is close enough to TLV for this apples- to-pears comparison.

For TLV, you have a type value of 16 to 32 bits, depending on the predictions on the need to extend the protocol. For XML, you have <tag </tag>, thus, 4 bytes plus whatever length you make the tag from 'X' to 'WeLikeToPutTheWholeRFCIntoTheTagSoThatTheImplementerDoesntHaveToReadThe RFC'.

For length, this is already counted above for XML; for TLV, it would presumably be again 16 or 32 bits.

Thus, the basic overhead is fairly similar if you choose small element names. Since numeric labels aren't self-describing at all, this only seems like a fair comparison.

The big difference is in the datatypes. Clearly, if you have long binary opaque data, base64 encoding typical for XML costs you about 30%.

Integers and floats are trickier: Unless you believe in ASN.1-style variable-length encoding, it is quite possible that in many practical cases, XML integers will be shorter since TLV always has to allocate the maximum range, presumably 32 bits for integers and 32 (single precision) or 64 bits for floats. You can build custom fixed-point types in binary encodings, but they are a pain to get right (All kinds of funny things happen because of signed vs. unsigned issues, for example. Java programmers will hate you if you try this...)

Strings are obviously the same size in either TLV or XML, except for the use of delimiters (such as ") or length fields, but that's a wash at roughly 2 bytes each.

The one big difference for XML is the namespace declaration. The impact of that depends on the number of namespaces and the size of the overall document. This is also hard to compare since they give you proprietary extensions without the kludges like vendor IDs that binary protocols have to go through (unless they choose similar Java- style or domain-style tags). Indeed, for namespace-style extensions that aren't IANA registered, TLV is usually a pain. You end up with either per-TLV long labels or re-invent the XML-style indirection approach.

This is a rough comparison, but it indicates that for the same functionality, the cost isn't all that different. I'm guessing that for protocols that have a common application mix of text and mostly- integer numeric values, that we're talking about 10 to 30%.

You do pay a price for very long tags, but only if you can't use gzip- style compression, which essentially does the text-to-code translation automatically and without penalty.

I'm not comparing other textual, non-XML approaches here. Frankly, I don't think they have a chance in the market place and the size difference for the ones I have seen isn't all that large.

A separate issue is the functionality. This is left as an exercise for another post.

Henning

_______________________________________________
XCON mailing list
XCON at ietf.org
https://www1.ietf.org/mailman/listinfo/xcon

Attachment: get-xsd-names.pl
Description: Binary data

_______________________________________________
XCON mailing list
XCON at ietf.org
https://www1.ietf.org/mailman/listinfo/xcon

Note: Messages sent to this list are the opinions of the senders and do not imply endorsement by the IETF.