Jehan 629bc879f3 script, src: add generic Korean model.
Until now, Korean charsets had its own probers as there are no
single-byte encoding for writing Korean. I now added a Korean model only
for the generic character and sequence statistics.

I also improved the generation script (script/BuildLangModel.py) to
allow for languages without single-byte charset generation and to
provide meaningful statistics even when the language script has a lot of
characters (so we can't have a full sequence combination array, just too
much data). It's not perfect yet. For instance our UTF-8 Korean test
file ends up with confidence of 0.38503, which is low for obvious Korean
text. Still it works (correctly detected, with top confidence compared
to others) and is a first step toward more improvement for detection
confidence.
2022-12-14 00:23:13 +01:00
..
ar.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
cs.py LangModels: add support for Czech. 2016-09-21 03:33:50 +02:00
da.py script, src, test: add IBM865 support for Danish. 2022-11-30 19:57:52 +01:00
de.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
el.py LangModels: update the Greek language models. 2016-05-25 17:39:10 +02:00
eo.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
es.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
et.py script: forgot to commit the Estonian description. 2016-09-27 00:51:19 +02:00
fi.py LangModels: add Finnish support. 2016-09-21 18:27:39 +02:00
fr.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
ga.py LangModels: added support for Irish Gaelic. 2016-09-27 00:49:05 +02:00
he.py script, src: generate the Hebrew models. 2022-12-14 00:23:13 +01:00
hr.py LangModels: new Croatian models. 2016-09-26 01:32:49 +02:00
hu.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
it.py LangModels: add Italian support. 2016-09-21 18:52:09 +02:00
ko.py script, src: add generic Korean model. 2022-12-14 00:23:13 +01:00
lt.py LangModels: add support for Latvian | Lithuanian / ISO-8859-4 | ISO-8859-10. 2016-09-21 00:27:16 +02:00
lv.py LangModels: add support for Latvian | Lithuanian / ISO-8859-4 | ISO-8859-10. 2016-09-21 00:27:16 +02:00
mt.py LangModels: support for Maltese / ISO-8859-3. 2016-09-21 02:11:31 +02:00
no.py Add norwegian support 2022-11-30 19:09:09 +01:00
pl.py LangModels: add Polish support. 2016-09-21 17:30:15 +02:00
pt.py LangModels: add support for Portuguese / ISO-8859-1. 2016-09-21 00:01:07 +02:00
ro.py LangModels: Romanian support added. 2016-09-28 19:57:50 +02:00
sk.py script: language script for Slovak forgotten. 2016-09-21 18:58:12 +02:00
sl.py LangModels: add Slovene support. 2016-09-28 22:13:17 +02:00
sv.py LangModels: add Swedish support. 2016-09-28 22:42:13 +02:00
th.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
tr.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
vi.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00