[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Ltru] RFC 2277 - considerations



> RFC 2277 says: "This document uses the term "charset" to mean a set of
> rules for mapping from a sequence of octets to a sequence of characters,
> such as the combination of a coded character set and a character encoding
> scheme; this is also what is used as an identifier in MIME "charset="
> parameters, and registered in the IANA charset registry [REG].  (Note that
> this is NOT a term used by other standards bodies, such as ISO).".

> The point is that if the term is not used elsewhere, it may mean that the
> concept is adapted to the network dynamic environment and others are less
> or not. It charset=encording_scheme+character_set.

The term 'charset' got to where it is because when the MIME spec was created,
the people creating it were thinking in terms of 7-bit and 8-bit encodings,
and in addition just (most probably inconciously) shortened "coded character
set" to "character set". So the term is there mostly by accident.

I'm sorry, but this simply isn't true. The term "charset" was carefuly and
intentionlly chosen, as was the way the term is defined. The terms "coded
character set" and "character encoding scheme" were just as intentionally
rejected for use in MIME. And none of this has anything to do with 7bit versus
8bit encoding issues.

I won't bother getting into the hows and whys of this choice since it is in no
way relevant to the present discussion on this list, but I did want to point
out that this position, although often stated, is incorrect.

...

> But the more I think of them,  the more I have difficulty understanding
> what the "script" notion, introduced in the Draft, brings in addition to
> the charsets: it belongs to it.

There is some connection, in that many "charset"s only encode one script,
or to be more precise, one script + basic ASCII + some symbols.
But there are some important "charset"s (in particular UTF-8 and
UTF-16,...) where this doesn't apply. Also, there are many other
encodings that contain multiple scripts (e.g. you can write
Greek with iso-2022-jp, and so on).

And since the trend is (hopefully) towards using all-inclusive charsets like
utf-8, the ability to determine script from the charset label alone, which as
you say never worked all that well, is going to disappear over time. OTOH, as
things coalesce around Unicode (irrespective of encoding), the need to know
lots of charsets in order to dig script information out of the actual content
is going to decrease.

> I therefore tend to think the "script" information is to be located in
> the charset tag.

The Web, email, and a lot of other things have worked extremely well without
script information in charset tags, and I don't see why this would not
continue.

Absolutely. Charsets were carefully defined to provide the information necessary to display or process a given object. It is very intentionally NOT
defined to be a label describing the specific content of a particular document.

> I suppose they are able to understand UTF-8.latin as UTF-8 and that
> legacy is transparent?

Definitely not. For language tags, quite a few applications understand
subtag-based prefixes, as the specs have been defined with subtags in mind
from the start.

And those which do not at least ignore subtags are therefore broken.

For charsets, they do not. Charsets do not have and
never had subtags.

Absolutely. And moreover, the rules for what constitutes a "charset" are
intentionally pretty narrow, so as to prevent creep of stuff into charset-space
that properly belongs elsewhere. (Sadly, there was a period during which the
rules weren't being properly applied to the charset registration process, so
there is some amount of cruft in the registry.)

				Ned

_______________________________________________
Ltru mailing list
Ltru at lists.ietf.org
https://www1.ietf.org/mailman/listinfo/ltru




Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.