As mentioned in the other post, this is interesting analysis, so I
thought I'd try to add a bit more detail to it.
I've compared XML encodings with long tag names with a Megaco like
encoding (in this case Lumas*) and a binary encoding (Lumas again).
*To justify the choice of Lumas encoding here, the original Lumas
was an attempt to formalise the Megaco text encoding so that it
could work for both text and binary. So Lumas has a lot in common
with Megaco encoding. Indeed, change "context = 123 { ..." to
"context = { 123 ..." and you're there!
Note that although Megaco is a text encoding (one of them anyway!),
it probably has more similarities in terms of message length with a
binary encoding than XML encoding. This is because in a number of
cases Megaco has no tag, or short tag.
Tagging is obviously a big issue. To get an idea of the length of
typical XML tags I grepped (well Perled really) out the element and
attribute names defined in the CPCP draft. When doing this I got
an average tag length of 12.7 characters. I noted that many of the
names were of the form "a-long-tag-name", so I derived short tag
names by taking the first letter of each sub-word. The average of
this was 2.08 characters. (I've attached the Perl code used so
that you can replicate my analysis if desired.)
As noted in the other mail, XSD does not lend itself to designing
XML with small tags. If everything is called X (or similar) the
XSD doesn't describe the message very well. So, although you can
define an XSD with small tags, I think in practice you wouldn't!
(Aside: This is why Lumas allows for specifying a long descriptive
name in the message definition, and specifying a short name for
tagging.)
For the binary encoding I used Lumas's binary rules. This uses
variable length tag and length fields similar to how UTF-8 is
variable length. As UTF-8 is commonly part of XML I can't see that
this is an issue.
So here's the results I came up with:
Bool:
XML - Typical name: 5(<></>) + 2*12(tag) + 1(data) = 30
XML - Short name: 5(<></>) + 2 * 2(tag) + 1(data) = 10
Lumas - Text Long Name: 2(note 1) + 12(tag) + 1(data) = 15
Lumas - Text short name: 2(note 1) + 2(tag) + 1(data) = 5
Lumas - Text untagged: 1(note 2) + 1(data) = 2
Lumas binary: 2(tag+len) + 0(data note 3) = 2
Ratio XML Long/Lumas long: 2
Ratio XML long/Lumas short: 6
Ratio XML short/Lumas short: 2
Ratio XML long/binary: 15
Ratio XML short/binary: 15
16 bit int:
XML - Typical name: 5(<></>) + 2*12(tag) + 5(data) = 34
XML - Short name: 5(<></>) + 2 * 2(tag) + 5(data) = 14
Lumas - Text Long Name: 2(note 1) + 12(tag) + 5(data) = 19
Lumas - Text short name: 2(note 1) + 2(tag) + 5(data) = 9
Lumas - Text untagged: 1(note 2) + 5(data) = 6
Lumas binary: 2(tag+len) + 2(data) = 4
Ratio XML Long/Lumas long: 1.78
Ratio XML long/Lumas short: 3.7
Ratio XML short/Lumas short: 1.5
Ratio XML long/binary: 8.5
Ratio XML short/binary: 3.5
32 bit int:
XML - Typical name: 5(<></>) + 2*12(tag) + 10(data) = 39
XML - Short name: 5(<></>) + 2 * 2(tag) + 10(data) = 19
Lumas - Text Long Name: 2(note 1) + 12(tag) + 10(data) = 24
Lumas - Text short name: 2(note 1) + 2(tag) + 10(data) = 14
Lumas - Text untagged: 1(note 2) + 10(data) = 11
Lumas binary: 2(tag+len) + 4(data) = 6
Ratio XML Long/Lumas long: 1.6
Ratio XML long/Lumas short: 2.7
Ratio XML short/Lumas short: 1.3
Ratio XML long/binary: 6.5
Ratio XML short/binary: 3.1
string: http://example.com/conference/on (length = 32)
XML - Typical name: 5(<></>) + 2*12(tag) + 32(data) = 61
XML - Short name: 5(<></>) + 2 * 2(tag) + 32(data) = 41
Lumas - Text Long Name: 2(note 1) + 12(tag) + 2(quoting) + 32(data)
= 48
Lumas - Text short name: 2(note 1) + 2(tag) + 2(quoting) + 32(data)
= 38
Lumas - Text untagged: 1(note 2) + 2(quoting) + 32(data) = 35
Lumas binary: 2(tag+len) + 32(data) = 34
Ratio XML Long/Lumas long: 1.27
Ratio XML long/Lumas short: 1.6
Ratio XML short/Lumas short: 1.07
Ratio XML long/binary: 1.79
Ratio XML short/binary: 1.2
string: Sally (length = 5)
XML - Typical name: 5(<></>) + 2*12(tag) + 5(data) = 34
XML - Short name: 5(<></>) + 2 * 2(tag) + 5(data) = 14
Lumas - Text Long Name: 2(note 1) + 12(tag) + 2(quoting) + 5(data)
= 21
Lumas - Text short name: 2(note 1) + 2(tag) + 2(quoting) + 5(data)
= 11
Lumas - Text untagged: 1(note 2) + 5(data) = 6
Lumas binary: 2(tag+len) + 5(data) = 7
Ratio XML Long/Lumas long: 1.6
Ratio XML long/Lumas short: 3.0
Ratio XML short/Lumas short: 1.27
Ratio XML long/binary: 4.8
Ratio XML short/binary: 2
note 1: Lumas needs an = sign and a minimum of 1 white space char
to separate values.
note 2: Lumas needs a minimum of 1 white space char to separate
values.
note 3: The Lumas binary encoding encodes a bool into 2 bytes using
a special length code.
Of course the overall message saving depends on how often each of
these types comes up.
To me the analysis shows that a binary encoding is typically very
much shorter than a typical XML encoding (with long tag names etc.).
BUT: Most of this gain is realised by using a Megaco like text
encoding. (This is captured by the XML long/Lumas short ratio.) As
this is what Lumas offers, rather than inventing yet another
message encoding method, I would say that the debate should be
between an XML encoding and a Lumas text encoding rather than
between an XML encoding and a binary encoding!
Pete.
--
=============================================
Pete Cordell
Tech-Know-Ware Ltd
for XML to C++ data binding visit
http://www.tech-know-ware.com/lmx
(or http://www.xml2cpp.com)
=============================================
----- Original Message ----- From: "Henning Schulzrinne"
<hgs at cs.columbia.edu>
To: "XCON-IETF" <xcon at ietf.org>
Sent: Friday, March 17, 2006 4:19 AM
Subject: [XCON] On encodings
Since we're on the perennially fun topic of protocol encodings,
maybe we can get past the stereotypes and try to do an
engineering comparison, or at least an attempt at an
approximation. Roughly speaking, an XML element is equivalent in
functionality to a TLV object: it contains a label (element tag
and the 'type' field), length or delimiter fields to figure out
the boundaries, and the actual data (value). ASN.1 BER is close
enough to TLV for this apples- to-pears comparison.
For TLV, you have a type value of 16 to 32 bits, depending on the
predictions on the need to extend the protocol. For XML, you have
<tag </tag>, thus, 4 bytes plus whatever length you make the tag
from 'X' to
'WeLikeToPutTheWholeRFCIntoTheTagSoThatTheImplementerDoesntHaveToRead
The RFC'.
For length, this is already counted above for XML; for TLV, it
would presumably be again 16 or 32 bits.
Thus, the basic overhead is fairly similar if you choose small
element names. Since numeric labels aren't self-describing at
all, this only seems like a fair comparison.
The big difference is in the datatypes. Clearly, if you have long
binary opaque data, base64 encoding typical for XML costs you
about 30%.
Integers and floats are trickier: Unless you believe in ASN.1-
style variable-length encoding, it is quite possible that in many
practical cases, XML integers will be shorter since TLV always has
to allocate the maximum range, presumably 32 bits for integers
and 32 (single precision) or 64 bits for floats. You can build
custom fixed-point types in binary encodings, but they are a pain
to get right (All kinds of funny things happen because of signed
vs. unsigned issues, for example. Java programmers will hate you
if you try this...)
Strings are obviously the same size in either TLV or XML, except
for the use of delimiters (such as ") or length fields, but
that's a wash at roughly 2 bytes each.
The one big difference for XML is the namespace declaration. The
impact of that depends on the number of namespaces and the size
of the overall document. This is also hard to compare since they
give you proprietary extensions without the kludges like vendor
IDs that binary protocols have to go through (unless they choose
similar Java- style or domain-style tags). Indeed, for namespace-
style extensions that aren't IANA registered, TLV is usually a
pain. You end up with either per-TLV long labels or re-invent the
XML-style indirection approach.
This is a rough comparison, but it indicates that for the same
functionality, the cost isn't all that different. I'm guessing
that for protocols that have a common application mix of text and
mostly- integer numeric values, that we're talking about 10 to 30%.
You do pay a price for very long tags, but only if you can't use
gzip- style compression, which essentially does the text-to-code
translation automatically and without penalty.
I'm not comparing other textual, non-XML approaches here. Frankly,
I don't think they have a chance in the market place and the size
difference for the ones I have seen isn't all that large.
A separate issue is the functionality. This is left as an
exercise for another post.
Henning
_______________________________________________
XCON mailing list
XCON at ietf.org
https://www1.ietf.org/mailman/listinfo/xcon
<get-xsd-names.pl>