Following is a long screed, probably of no value, set down in stream of consciousness while waiting to go off to view a baseball game this morning. It is still here this evening and I find no desire to revise it, but think it might be useful, at least as raw meat for people to seize at. I find it humorous that Mark and I use different examples to say similar (but not identical) things. FWIW, Addison --- If the script subtag is necessary, then it must go somewhere. If we turn everything into a variant, we end up with variant soup. RFC 3066bis requires the subtags to be in a specific order, keeping us well away from the fatal argument about what order the subtags go in if there is no inherent order. Some languages are (customarily) written and some are not. For written forms of a given language, script and orthographic variation are important in tagging because it makes a difference in the kinds of processing one does on that text--spell checking and rendering and so forth. It is widely recognized that spoken language has a much wider variation than its written counterpart. In some ways, what ones wants or needs to tag in spoken language is different from what ones needs to identify in written text. For example, in written language the distinction between en-US and en-GB is important. There are spelling and grammatical issues for example (tyre/tire, jail/gaol, color/colour, etc.). There are also word-choice differences (pavement/sidewalk, boot/trunk, etc.). The differences in spoken English, however, are quite diverse and different from those in the written forms. The question is: when I have a recording of some dialect (Geordie, for example), what is *necessary* to identify? It will be quite different from what varies in written language and script is usually not interesting in this case. In that regard, "en-GB" might not be the most useful identifier any more (spoken English in the U.K. varies rather more widely than in the USA, although that is a VERY subjective claim). That doesn't mean that script is less important than region or variant for written language, though. It means that script is not important to an application of spoken text. What is important for text remains the same. Tag choice is important: giving users the tools they need in different circumstances is vital. For text, this is frequently enough the script subtag. For spoken text other things may be important (although, again, I think that for example a recording of speech generally doesn't need tagging that is as specific, since spoken text is naturally more expressive than written text--some things don't need to be identified if they are mutually understood). The ontology of spoken (or signed or otherwise signaled) language is different from that of written text. This is not a bad thing; it is just how things are. RFC 3066 is based fundamentally on the "Atomic Theory of Languages", which says that a language can be defined and that "en-US" represents, somehow, a distinct class of documents (a mixture of written, spoken, signed, or signaled). It is based on the idea that one can register a tag for a language that is separate from all other languages (there is none of this subtag stuff in RFC 3066--"en" is a different language from "en-US"). If you accept RFC 3066, then you accept the premise that the tag for written and the tag for spoken text is the same. If you accept RFC 3066bis, then you accept that this MAY be true, but also that your application may have varying needs in what is identified. The idea that we must never interpolate one or another subtag into its logical place in the tag hierarchy because a different application doesn't need that subtag is demonstrably false. Finally, I think it is also important to recognize that language tags are an abstraction of language. We cannot hope to encode the wide variation of style and substance in text (or speech) in a simple tag. We get close enough. Examples of literature such as Twain's Pudd'nhead Wilson--a book written almost wholly in an exaggerated regional (and, perhaps only later, stereotypical) dialect--show how the expressiveness of spoken text is difficult to "encode" in any kind of text (Twain manages it somehow). Somehow tagging that document as "en-US" fails to capture the substance of the text (my spell checker complains about nearly every word the grammar checker does not). But it is sufficient, in many ways, to use such a tag in that case: there is no more American a writer than Twain and it is part of the world of "en-US" as certainly as Hemingway's famously simple (or my infamously convoluted) sentences are. In short, I find this hand wringing somewhat pointless. I think Ira and Tex's points, while well-taken in isolation, break down when considered in the broader context of different applications. Fundamentally, this is what the matching draft is about. The idea of a subtag registry is a different matter and I think we have the design right. Regards, Addison Addison P. Phillips Globalization Architect, Quest Software Chair, W3C Internationalization Core Working Group Internationalization is not a feature. It is an architecture. > -----Original Message----- > From: ltru-bounces at lists.ietf.org [mailto:ltru-bounces at lists.ietf.org] On > Behalf Of McDonald, Ira > Sent: 2005?6?12? 10:14 > To: ltru at ietf.org > Subject: [Ltru] FW: tags for written or spoken content, was Swiss > german,spoken > > Hi, > > Many people may have seen this note from Tex Texin to the > IETF Languages list, but I cross-post it here, because it > points out a fundamental IETF last call vulnerability for > the Language Subtag Registry, to whit, that the _entire_ > justification for shoehorning 'script' between 'language' > and 'region' is the overwhelming focus on traditional > uses of language tags for _written_ content. > > The entire (unloved) "Suppress-Script" stuff attempts to > minimize the damage in the real world (now and in the > indefinite future) of the 'script' interloper. > > A much cleaner solution remains to push 'script' back where > it belongs in medium-neutral language tags --> after the > 'region' code. > > And I think that bending 'script' to include spoken (as > Randy speculated somewhere recently?) is just awful. > > Cheers, > - Ira > > Ira McDonald (Musician / Software Architect) > Blue Roof Music / High North Inc > PO Box 221 Grand Marais, MI 49839 > phone: +1-906-494-2434 > email: imcdonald at sharplabs.com > > -----Original Message----- > From: ietf-languages-bounces at alvestrand.no > [mailto:ietf-languages-bounces at alvestrand.no]On Behalf Of Tex Texin > Sent: Saturday, June 11, 2005 8:44 PM > To: Michael Everson > Cc: ietf-languages at iana.org > Subject: tags for written or spoken content, was Swiss german, spoken > > > > > Michael Everson wrote: > > > The tags with we are concerned assume some sort of orthography. > > > > Well, the tendency has been to associate tags with writing since so much > of computer content is written, and so it is mostly written material > that software developers tag. > > However, rfc 3066 says: > > "The language tag always defines a language as spoken (or written, > signed or otherwise signaled) by human beings for communication of > information to other human beings." > > so orthography is not a requirement and in fact, this tendency to > (incorrectly I believe) presume writing, is one of the reasons I > objected to "script" being placed between language and region in > 3066bis. As we tag more multimedia, the emphasis on writing is falsely > placed. Unless we want to have a separate tagging system for audio... > > tex > _______________________________________________ > Ietf-languages mailing list > Ietf-languages at alvestrand.no > http://www.alvestrand.no/mailman/listinfo/ietf-languages > > _______________________________________________ > Ltru mailing list > Ltru at lists.ietf.org > https://www1.ietf.org/mailman/listinfo/ltru _______________________________________________ Ltru mailing list Ltru at lists.ietf.org https://www1.ietf.org/mailman/listinfo/ltru
Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.