uchardet/script/BuildLangModelLogs
Jehan 629bc879f3 script, src: add generic Korean model.
Until now, Korean charsets had its own probers as there are no
single-byte encoding for writing Korean. I now added a Korean model only
for the generic character and sequence statistics.

I also improved the generation script (script/BuildLangModel.py) to
allow for languages without single-byte charset generation and to
provide meaningful statistics even when the language script has a lot of
characters (so we can't have a full sequence combination array, just too
much data). It's not perfect yet. For instance our UTF-8 Korean test
file ends up with confidence of 0.38503, which is low for obvious Korean
text. Still it works (correctly detected, with top confidence compared
to others) and is a first step toward more improvement for detection
confidence.
2022-12-14 00:23:13 +01:00
..
LangArabicModel.log Rebuild a bunch of language models. 2022-12-14 00:23:13 +01:00
LangCroatianModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangCzechModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangDanishModel.log Rebuild a bunch of language models. 2022-12-14 00:23:13 +01:00
LangEsperantoModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangEstonianModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangFinnishModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangFrenchModel.log Rebuild a bunch of language models. 2022-12-14 00:23:13 +01:00
LangGermanModel.log Rebuild a bunch of language models. 2022-12-14 00:23:13 +01:00
LangGreekModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangHebrewModel.log script, src: generate the Hebrew models. 2022-12-14 00:23:13 +01:00
LangHungarianModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangIrishModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangItalianModel.log Rebuild a bunch of language models. 2022-12-14 00:23:13 +01:00
LangKoreanModel.log script, src: add generic Korean model. 2022-12-14 00:23:13 +01:00
LangLatvianModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangLithuanianModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangMalteseModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangPolishModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangPortugueseModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangRomanianModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangSlovakModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangSloveneModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangSpanishModel.log Rebuild a bunch of language models. 2022-12-14 00:23:13 +01:00
LangSwedishModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangThaiModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangTurkishModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00
LangVietnameseModel.log src, script: regenerate all existing language models. 2022-12-14 00:23:13 +01:00