On Tue, Dec 11, 2007 at 11:47:46PM +0100,
Frank Ellermann <nobody at xyzzy.claranet.de> wrote
a message of 17 lines which said:
> Internet Drafts need to be US-ASCII. Just pipe
> it through 4645bis.awk to get the UTF-8 version:
> <http://purl.net/xyzzy/home/ltru/4645bis.awk>
Or ncr2utf8.py, attached :-)
I tested the UTF-8 version of 4645bis and everything is OK. It is much
simpler now :-)
unicodechar = satisfy (\thechar ->
let c = (ord thechar) in
(c >= 0x21 && c <= 0x10ffff))
<?> "Character"
#!/usr/bin/python
""" Converts a text file with hexadecimal Numeric Character References
(like œ) to an UTF-8 file"""
import sys
import re
ncr = re.compile("&#x([0-9A-F]+);", re.IGNORECASE)
extension = re.compile("^(.*)\.([a-z0-9_-]+)$", re.IGNORECASE)
def convert(thematch):
codepoint = long(thematch.group(1), 16)
return unichr(codepoint)
for ifilename in sys.argv[1:]:
print "Converting %s..." % ifilename
match = extension.search (ifilename)
if match:
ext_ifile = match.group(2)
ofilename = match.group(1) + "-utf8." + ext_ifile
else:
ofilename = ifilename + "-utf8"
ifile = open(ifilename, "r")
ofile = open(ofilename, "w")
data = unicode(ifile.read(), "ascii")
udata = re.sub(ncr, convert, data)
ifile.close()
ofile.write(udata.encode("utf-8"))
ofile.close()
_______________________________________________ Ltru mailing list Ltru at ietf.org https://www1.ietf.org/mailman/listinfo/ltru
Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.