script: add an error handling for when iconv fail to convert from a codepoint.

It could happen either when our character set table is wrong, but it
could also happen for when iconv has a bug with incomplete charset
tables. For instance, I was trying to implement IBM880 for #29, but
iconv was missing a few codepoints. For instance, it seems to think that
0x45 (є), 0.55 (ў), 0x74 (Ў) are meant to be illegal in IBM880 (and
possibly others), but the information we have seem to say they are
valid.
And Python does not support this character set at all.

This test will help discovering the issue earlier (rather than breaking
a few line later because `iconv` failed and returned an empty string,
making ord() fail with TypeError exception.

See: https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/29#note_1691847
This commit is contained in:
Jehan 2022-12-17 18:00:22 +01:00
parent 6d31689632
commit 5e25e93da7

View File

@ -537,6 +537,9 @@ for charset in charsets:
except FileNotFoundError:
print('Error: "{}" is not a supported charset by python and `iconv` is not installed.\n')
exit(1)
if len(uchar) == 0:
print('TypeError: iconv failed to return a unicode character for codepoint "{}" in charset {}.\n'.format(hex(cp), charset))
exit(1)
#if lang.case_mapping and uchar.isupper() and \
#len(unicodedata.normalize('NFC', uchar.lower())) == 1:
# Unless we encounter special cases of characters with no