At 10:58 05/05/08, JFC (Jefsey) Morfin wrote: >This is not a position. Just a thinking and a call for comments. > >---- >>RFC 2277 says: "This document uses the term "charset" to mean a set of rules for mapping from a sequence of octets to a sequence of characters, such as the combination of a coded character set and a character encoding scheme; this is also what is used as an identifier in MIME "charset=" parameters, and registered in the IANA charset registry [REG]. (Note that this is NOT a term used by other standards bodies, such as ISO).".
>>The point is that if the term is not used elsewhere, it may mean that the concept is adapted to the network dynamic environment and others are less or not. It charset=encording_scheme+character_set.
The term 'charset' got to where it is because when the MIME spec was created, the people creating it were thinking in terms of 7-bit and 8-bit encodings, and in addition just (most probably inconciously) shortened "coded character set" to "character set". So the term is there mostly by accident.>I first considered that W3C was right in having identified its need of "scripts" support, however the idea of making them dependent from languages seemed to be strange.
It wasn't W3C which identified the need for scripts support. Some people that you associate with W3C may have been involved, but that doesn't mean that it was W3C. The need for scripts in language tags (or separate) is an old issue. I remember a discussion with Michael Everson, who proposed an HTTP header Accept-Script (or some such) years ago on the ietf-languages list. The discussion at that time came to the conclusion that this (having script orthogonal to language) might be a nice idea in principle, but that in practice, the connection is extremely strong, and in most cases, script info isn't necessary because it can easily be implied.>But the more I think of them, the more I have difficulty understanding what the "script" notion, introduced in the Draft, brings in addition to the charsets: it belongs to it.
There is some connection, in that many "charset"s only encode one script, or to be more precise, one script + basic ASCII + some symbols. But there are some important "charset"s (in particular UTF-8 and UTF-16,...) where this doesn't apply. Also, there are many other encodings that contain multiple scripts (e.g. you can write Greek with iso-2022-jp, and so on). >The more I see sources of conflicts if this is not respected. If you see such conflicts, could you give an actual example?>The more I see that the script is one of the rules which shares in the definition of the charcter set, and the more I fail to see where the W3C has a problem (except may be in confusing charset with encording scheme only, however http://www.w3.org/TR/REC-html40/charset.html starts with a clear "character set" part where it specifically quote Latin and Cyrilic)).
> >I come back to a normal process of access to a page/document. >>1. to be able to read it I need to know the charset. This is the first information. It tells me the rules for mapping from a sequence of bytes to a sequence of characters (character encoding scheme: ex. UTF-8
Yes indeed. Having the correct 'charset' is extremely important. >and combination of coded characters, ex: ISO 15924). Ex: UTF-8-Latin No. If you know it's UTF-8, you can just look inside the document, and check what script(s!) are used.>2. then when I read I need to understand. I have the language. And possible region. As per RFC 3066 existing scheme and not calling for a modification of the existing libraries.
>>3. the interest is that this is compatible with IDN tables (and permits to address the high level IDN homograph problem, since charsets are documented everywhere). I also note that RFC 2277 and 3066 seem to address the locales need (however CLDR may have some proprietary special needs, authors have not documented?)
>>I therefore tend to think the "script" information is to be located in the charset tag.
The Web, email, and a lot of other things have worked extremely well without script information in charset tags, and I don't see why this would not continue.>I suppose they are able to understand UTF-8.latin as UTF-8 and that legacy is transparent?
Definitely not. For language tags, quite a few applications understand subtag-based prefixes, as the specs have been defined with subtags in mind from the start. For charsets, they do not. Charsets do not have and never had subtags.Regards, Martin.
_______________________________________________ Ltru mailing list Ltru at lists.ietf.org https://www1.ietf.org/mailman/listinfo/ltru
Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.