Mark,I think my solution would be the most elegant way of dealing with this. Doug Ewell is a better approximation of a Turing machine than any inversion algorithm. Perhaps the only addition to my message below would be to permit the same form of editing of the description fields in 4645bis.
IOW, the rules would be: 1. Follow the source standard. 2. Split multiple names into multiple description fields. 3. Omit or edit problematic items, subject to consensus. Addison Mark Davis wrote:
I processed through the 01 file and generated the following data, listing every case where a Tag or Subtag contains a Description with a comma or "(". It is comma-delimited, so a useful way to view it is in Excel or OpenOffice, where you can filter and sort.http://macchiato.com/ltru/ltru-commas.txt (excel version http://macchiato.com/ltru/ltru-commas.xls )My conclusions are that it would be possible to support option A (allow inverted names), but it would definitely require some considerable cleanup. The key is that the inversion has to be predictable, otherwise people can get garbage when they try to reconstruct the "real" name.The algorithm to "de-invert" used the following. It assumes the Description is of the form:precomma ("," postComma) ("(" parenComment ")")?When there is a comma outside of the parenComment, the inversion is formed as:postComma " " preComma // plus "(" parenComment ")" if foundThe deinversion is listed after "=>" in the file. It doesn't work in many cases. I marked certain cases with "@" in column E that could be mechanically distinguished.1. There are examples containing a double comma; those don't invert properlyResian, Resianic, Rezijan2. There is one case where a description contains nothing but comment. This should just be deparened, IMO.(alias for Hiragana + Katakana) 3. There are multiple duplicates. "Ainu (Japan)" and "Ainu"[The vast majority of the items with comments don't duplicate, so it is clearly more consistent not to.]4. Sometimes the inverted form is *also* listed as a name; much more often it is not. This is also an inconsistency that should be fixed.Altai, Southern5. There are multiple other failures of inversion, like the following. These cannot be detected automatically; they must be reviewed by knowledgeable people (column F)Natisone dialect, Nadiza dialect => Nadiza dialect Natisone dialect 6. There are missing inversions, which for consistency should be present. *Bulgarian, Old(These were done with a simple check, looking for Northern, North, ....) I'm sure that there are many more, but those would require experts to review.Now, my preference is still to do option B -- I think A is inherently troublesome for translation -- but I wanted to give people the data I found.What I think would also be useful, as a by-product of this, would be to move all of the parenComments into real Comment fields.MarkOn 10/16/06, *Kent Karlsson * <kent.karlsson14 at comhem.se <mailto:kent.karlsson14 at comhem.se>> wrote:+1 as modified below (but needs wordsmithing). > From: Addison Phillips [mailto:addison at yahoo-inc.com <mailto:addison at yahoo-inc.com>] > I'd be happier with a rule for Doug and Michael to follow that is clear > and unambiguous and has the required latitude---and then letting the > problem take care of itself, as, so far, it has. Perhaps: > > -- > If more than one name is provided by the underlying standard > for a new > or updated record or the same name appears in multiple formats, the > Language Subtag Reviewer MAY select which name or names appear in a > Description field in the associated registry record (subject to > community consensus) and MAY edit the names to correct spelling, > orthographic, or formatting errors. even if there is just one name given in the source registry > The Reviewer MAY split a list of in this case: "MAY" -> "SHOULD", or (preferably) even "SHALL" > names into separate Description fields. Note: Additional Description > fields MAY be added in various formats via the registration process. /kent k _______________________________________________ Ltru mailing list Ltru at ietf.org <mailto:Ltru at ietf.org> https://www1.ietf.org/mailman/listinfo/ltru ------------------------------------------------------------------------ _______________________________________________ Ltru mailing list Ltru at ietf.org https://www1.ietf.org/mailman/listinfo/ltru
-- Addison Phillips Globalization Architect -- Yahoo! Inc. Internationalization is an architecture. It is not a feature. _______________________________________________ Ltru mailing list Ltru at ietf.org https://www1.ietf.org/mailman/listinfo/ltru
Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.