[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Ltru] Re: Test suite for language tags?



BTW, I had updated my regex to the final spec for 4646. Here is a single Perl or Java regex that does most of the parse:

Regex: ((?: [a-z A-Z]{2,3} (?: [-] [a-z A-Z]{3} ){0,3} | [a-z A-Z]{4,8} ))(?: [-] ((?: [a-z A-Z]{4} )) )?(?: [-] ((?: [a-z A-Z]{2} | [0-9]{3} )) )?(?: [-] ((?: (?: [0-9] [a-z A-Z 0-9]{3} | [a-z A-Z 0-9]{5,8} ) (?: [-] (?: [0-9] [a-z A-Z 0-9]{3} | [a-z A-Z 0-9]{5,8} ) )* )) )?(?: [-] ((?: (?: [a-w y-z A-W Y-Z] (?: [-] [a-z A-Z 0-9]{2,8} )+ ) (?: [-] (?: [a-w y-z A-W Y-Z] (?: [-] [a-z A-Z 0-9]{2,8} )+ ) )* )) )?(?: [-] ((?: [xX] (?: [-] [a-z A-Z 0-9]{1,8} )+ )) )?| ( (?i) art [-] lojban| cel [-] gaulish| en [-] (?: boont | GB [-] oed | scouse )| i [-] (?: ami | bnn | default | enochian | hak | klingon | lux | mingo | navajo | pwn | tao | tay | tsu )| no [-] (?: bok | nyn)| sgn [-] (?: BE [-] fr | BE [-] nl | CH [-] de)| zh [-] (?: cmn | zh [-] cmn [-] Hans | cmn [-] Hant | gan | guoyu | hakka | min | min [-] nan | wuu | xiang | yue))| ((?: [xX] (?: [-] [a-z A-Z 0-9]{1,8} )+ ))

It checks for the grandfathered tags, since otherwise too much cruft sneaks in. You can't check in regex that there are only single instances of each singleton extension. (In retrospect we could have allowed multiple singletons: we could have accepted en-a-bcdef-ghijk-b-123 -a-lmnop as equivalent to the canonical form en-a-bcdef-ghijk-lmnop-b-123, but that's water under the bridge at this point.) Of course, I didn't put this together by hand. The table used to build it is much more readable, at

http://unicode.org/cldr/data/tools/java/org/unicode/cldr/util/data/langtagRegex.txt

and a test file that includes strings mentioned on this list is at:

http://unicode.org/cldr/data/tools/java/org/unicode/cldr/util/data/langtagTest.txt
Mark
_______________________________________________
Ltru mailing list
Ltru at ietf.org
https://www1.ietf.org/mailman/listinfo/ltru

Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.