mirror of
https://gitlab.freedesktop.org/uchardet/uchardet.git
synced 2025-12-06 08:46:40 +08:00
script: character orders in single-byte language models should be maxed.
This happened when building a Croatian model which can be written with many different encodings. There were also many irrelevant glyphs (i.e. used in other languages) in these encodings so we ended with orders over 255, which breaks when converting to unsigned char. Just let's make sure that we don't cross the 250 limit (over is used for controls, illegal characters, symbols, numbers…). This means we may have several characters with order 249, but since orders over the frequent character list don't matter, this is not a problem.
This commit is contained in:
parent
05ba8555cd
commit
d76d33b88b
@ -414,10 +414,18 @@ for charset in charsets:
|
||||
uchar = local_lowercase(uchar, lang)
|
||||
for order, (char, ratio) in enumerate(sorted_ratios):
|
||||
if char == ord(uchar):
|
||||
CTOM_str += '{:3},'.format(order)
|
||||
CTOM_str += '{:3},'.format(min(249, order))
|
||||
break
|
||||
else:
|
||||
CTOM_str += '{:3},'.format(n_char)
|
||||
# XXX: we must make sure the character order does not go
|
||||
# over the special characters (250 currently). This may
|
||||
# actually happen when building a model for a language
|
||||
# writable with many different encoding. So let's just
|
||||
# ceil the order value at 249 max.
|
||||
# It may be an interesting alternative to add another
|
||||
# constant for any character with an order > freqCharCount.
|
||||
# Maybe IRR (irrelevant character) or simply CHR.
|
||||
CTOM_str += '{:3},'.format(min(249, n_char))
|
||||
n_char += 1
|
||||
CTOM_str += ' /* {:X}X */'.format(line)
|
||||
CTOM_str += '\n};\n/*'
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user