2 Commits

Author SHA1 Message Date
Jehan
b7acffc806 script, src: remove generated statistics data for Korean. 2022-12-14 00:24:53 +01:00
Jehan
629bc879f3 script, src: add generic Korean model.
Until now, Korean charsets had its own probers as there are no
single-byte encoding for writing Korean. I now added a Korean model only
for the generic character and sequence statistics.

I also improved the generation script (script/BuildLangModel.py) to
allow for languages without single-byte charset generation and to
provide meaningful statistics even when the language script has a lot of
characters (so we can't have a full sequence combination array, just too
much data). It's not perfect yet. For instance our UTF-8 Korean test
file ends up with confidence of 0.38503, which is low for obvious Korean
text. Still it works (correctly detected, with top confidence compared
to others) and is a first step toward more improvement for detection
confidence.
2022-12-14 00:23:13 +01:00