mirror of
https://gitlab.freedesktop.org/uchardet/uchardet.git
synced 2025-12-06 16:56:40 +08:00
Until now, Korean charsets had its own probers as there are no single-byte encoding for writing Korean. I now added a Korean model only for the generic character and sequence statistics. I also improved the generation script (script/BuildLangModel.py) to allow for languages without single-byte charset generation and to provide meaningful statistics even when the language script has a lot of characters (so we can't have a full sequence combination array, just too much data). It's not perfect yet. For instance our UTF-8 Korean test file ends up with confidence of 0.38503, which is low for obvious Korean text. Still it works (correctly detected, with top confidence compared to others) and is a first step toward more improvement for detection confidence. |
||
|---|---|---|
| .. | ||
| LangArabicModel.log | ||
| LangCroatianModel.log | ||
| LangCzechModel.log | ||
| LangDanishModel.log | ||
| LangEsperantoModel.log | ||
| LangEstonianModel.log | ||
| LangFinnishModel.log | ||
| LangFrenchModel.log | ||
| LangGermanModel.log | ||
| LangGreekModel.log | ||
| LangHebrewModel.log | ||
| LangHungarianModel.log | ||
| LangIrishModel.log | ||
| LangItalianModel.log | ||
| LangKoreanModel.log | ||
| LangLatvianModel.log | ||
| LangLithuanianModel.log | ||
| LangMalteseModel.log | ||
| LangPolishModel.log | ||
| LangPortugueseModel.log | ||
| LangRomanianModel.log | ||
| LangSlovakModel.log | ||
| LangSloveneModel.log | ||
| LangSpanishModel.log | ||
| LangSwedishModel.log | ||
| LangThaiModel.log | ||
| LangTurkishModel.log | ||
| LangVietnameseModel.log | ||