> RFC 2277 says: "This document uses the term "charset" to mean a set of > rules for mapping from a sequence of octets to a sequence of characters, > such as the combination of a coded character set and a character encoding > scheme; this is also what is used as an identifier in MIME "charset=" > parameters, and registered in the IANA charset registry [REG]. (Note that > this is NOT a term used by other standards bodies, such as ISO).".
> The point is that if the term is not used elsewhere, it may mean that the > concept is adapted to the network dynamic environment and others are less > or not. It charset=encording_scheme+character_set.
The term 'charset' got to where it is because when the MIME spec was created, the people creating it were thinking in terms of 7-bit and 8-bit encodings, and in addition just (most probably inconciously) shortened "coded character set" to "character set". So the term is there mostly by accident.
I'm sorry, but this simply isn't true. The term "charset" was carefuly and intentionlly chosen, as was the way the term is defined. The terms "coded character set" and "character encoding scheme" were just as intentionally rejected for use in MIME. And none of this has anything to do with 7bit versus 8bit encoding issues. I won't bother getting into the hows and whys of this choice since it is in no way relevant to the present discussion on this list, but I did want to point out that this position, although often stated, is incorrect.
...
> But the more I think of them, the more I have difficulty understanding > what the "script" notion, introduced in the Draft, brings in addition to > the charsets: it belongs to it.
There is some connection, in that many "charset"s only encode one script, or to be more precise, one script + basic ASCII + some symbols. But there are some important "charset"s (in particular UTF-8 and UTF-16,...) where this doesn't apply. Also, there are many other encodings that contain multiple scripts (e.g. you can write Greek with iso-2022-jp, and so on).
And since the trend is (hopefully) towards using all-inclusive charsets like utf-8, the ability to determine script from the charset label alone, which as you say never worked all that well, is going to disappear over time. OTOH, as things coalesce around Unicode (irrespective of encoding), the need to know lots of charsets in order to dig script information out of the actual content is going to decrease.
> I therefore tend to think the "script" information is to be located in > the charset tag.
The Web, email, and a lot of other things have worked extremely well without script information in charset tags, and I don't see why this would not continue.
Absolutely. Charsets were carefully defined to provide the information necessary to display or process a given object. It is very intentionally NOT
defined to be a label describing the specific content of a particular document.
> I suppose they are able to understand UTF-8.latin as UTF-8 and that > legacy is transparent?
Definitely not. For language tags, quite a few applications understand subtag-based prefixes, as the specs have been defined with subtags in mind from the start.
And those which do not at least ignore subtags are therefore broken. For charsets, they do not. Charsets do not have and
never had subtags.
Absolutely. And moreover, the rules for what constitutes a "charset" are intentionally pretty narrow, so as to prevent creep of stuff into charset-space that properly belongs elsewhere. (Sadly, there was a period during which the rules weren't being properly applied to the charset registration process, so there is some amount of cruft in the registry.) Ned _______________________________________________ Ltru mailing list Ltru at lists.ietf.org https://www1.ietf.org/mailman/listinfo/ltru
Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.