[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Ltru] Inversions and other problems



I processed through the 01 file and generated the following data, listing every case where a Tag or Subtag contains a Description with a comma or "(". It is comma-delimited, so a useful way to view it is in Excel or OpenOffice, where you can filter and sort.

http://macchiato.com/ltru/ltru-commas.txt
(excel version http://macchiato.com/ltru/ltru-commas.xls )

My conclusions are that it would be possible to support option A (allow inverted names), but it would definitely require some considerable cleanup. The key is that the inversion has to be predictable, otherwise people can get garbage when they try to reconstruct the "real" name.

The algorithm to "de-invert" used the following. It assumes the Description is of the form:

precomma ("," postComma) ("(" parenComment ")")?

When there is a comma outside of the parenComment, the inversion is formed as:

postComma " " preComma                 // plus "(" parenComment ")" if found

The deinversion is listed after "=>" in the file. It doesn't work in many cases.
I marked certain cases with "@" in column E that could be mechanically distinguished.

1. There are examples containing a double comma; those don't invert properly
Resian, Resianic, Rezijan

2. There is one case where a description contains nothing but comment. This should just be deparened, IMO.
(alias for Hiragana + Katakana)

3. There are multiple duplicates.
"Ainu (Japan)" and "Ainu"

[The vast majority of the items with comments don't duplicate, so it is clearly more consistent not to.]

4. Sometimes the inverted form is *also* listed as a name; much more often it is not. This is also an inconsistency that should be fixed.
Altai, Southern

5. There are multiple other failures of inversion, like the following. These cannot be detected automatically; they must be reviewed
by knowledgeable people (column F)

Natisone dialect, Nadiza dialect => Nadiza dialect Natisone dialect


6. There are missing inversions, which for consistency should be present.
*Bulgarian, Old

(These were done with a simple check, looking for Northern, North, ....) I'm sure that there are many more, but those would require experts to review.

Now, my preference is still to do option B -- I think A is inherently troublesome for translation -- but I wanted to give people the data I found.

What I think would also be useful, as a by-product of this, would be to move all of the
parenComments into real Comment fields.

Mark

On 10/16/06, Kent Karlsson <kent.karlsson14 at comhem.se> wrote:

+1 as modified below (but needs wordsmithing).

> From: Addison Phillips [mailto:addison at yahoo-inc.com]

> I'd be happier with a rule for Doug and Michael to follow that is clear
> and unambiguous and has the required latitude---and then letting the
> problem take care of itself, as, so far, it has. Perhaps:
>
> --
> If more than one name is provided by the underlying standard
> for a new
> or updated record or the same name appears in multiple formats, the
> Language Subtag Reviewer MAY select which name or names appear in a
> Description field in the associated registry record (subject to
> community consensus) and MAY edit the names to correct spelling,
> orthographic, or formatting errors.

even if there is just one name given in the source registry

> The Reviewer MAY split a list of

in this case: "MAY" -> "SHOULD", or (preferably) even "SHALL"

> names into separate Description fields. Note: Additional Description
> fields MAY be added in various formats via the registration process.

                /kent k


______________________________
_________________
Ltru mailing list
Ltru at ietf.org
https://www1.ietf.org/mailman
/listinfo/ltru

_______________________________________________
Ltru mailing list
Ltru at ietf.org
https://www1.ietf.org/mailman/listinfo/ltru

Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.