Ian Hickson wrote:
> You are proposing that given a length and a series of bytes encoding text
> in a variable-sized encoding (UTF-8), the application return a series of
> characters. My point is that this means that you have two lengths (the
> length of the string in characters and the length of the string in bytes),
> so you risk inexperienced software authors making elementary yet dangerous
> mistakes in terms of how to read (or write) data to the stream. WebSocket
> tries to avoid ever mixing the two (you either deal with bytes and byte
> lengths, or you use sentinel bytes and no lengths -- you never have
> characters and byte lengths mixed together).
Inexperienced authors, especially those writing 100 lines of Perl,
will send ISO-8859-1 or other text which occasionally contains 0xff
bytes in the middle.
Even experience authors will make that mistake sometimes. What do you
think will happen when someone does something like this:
- Read list of filenames in a directory. They're UTF-8 (assumed),
or the author is unfamiliar with character encodings and everything
works fine in their ASCII development environment.
- Concatenate the list with newlines, as people do.
- Send the result as a frame.
Or this:
- Read lines from a text file, which is in UTF-8 encoding.
- Send each line as a frame.
- (Oops, one of the text files you gave me had an 0xff byte in it.)
Result: Because of assumptions, 0xff bytes will be sent occasionally
in the middle of a frame. Everything afterwards will break, but it'll
be rare enough that the author doesn't notice. For the same reason
you've explained authors get lengths wrong.
The sentinel approach does not solve this fragility problem, it merely
shifts it around to a different place.
-- Jamie
Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.