Jehan f8752f2b56 src, script: add concept of alphabet_mapping in language models.
This allows to handle cases where some characters are actually
alternative/variants of another. For instance, a same word can be
written with both variants, while both are considered correct and
equivalent. Browsing a bit Slovenian Wikipedia, it looks like they only
use them for titles there.

I use this the first time on characters with diacritics in Slovene.
Indeed these are so rarely used that they would hardly show in the stats
and worse, any sequence using these in tested text would likely show as
negative sequences hence drop the confidence in Slovenian. As a
consequence, various Slovene text would show up as Slovak as it's close
enough and contains the same character with diacritics in a common way.
2021-03-21 15:54:24 +01:00
..
ar.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
cs.py LangModels: add support for Czech. 2016-09-21 03:33:50 +02:00
da.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
de.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
el.py LangModels: update the Greek language models. 2016-05-25 17:39:10 +02:00
eo.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
es.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
et.py script: forgot to commit the Estonian description. 2016-09-27 00:51:19 +02:00
fi.py LangModels: add Finnish support. 2016-09-21 18:27:39 +02:00
fr.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
ga.py LangModels: added support for Irish Gaelic. 2016-09-27 00:49:05 +02:00
he.py script, src: generate the Hebrew models. 2021-03-17 23:22:50 +01:00
hi.py src: add Hindi/UTF-8 support. 2021-03-19 22:36:30 +01:00
hr.py LangModels: new Croatian models. 2016-09-26 01:32:49 +02:00
hu.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
it.py LangModels: add Italian support. 2016-09-21 18:52:09 +02:00
ko.py script, src: remove generated statistics data for Korean. 2021-03-20 22:59:52 +01:00
lt.py LangModels: add support for Latvian | Lithuanian / ISO-8859-4 | ISO-8859-10. 2016-09-21 00:27:16 +02:00
lv.py LangModels: add support for Latvian | Lithuanian / ISO-8859-4 | ISO-8859-10. 2016-09-21 00:27:16 +02:00
mt.py LangModels: support for Maltese / ISO-8859-3. 2016-09-21 02:11:31 +02:00
pl.py LangModels: add Polish support. 2016-09-21 17:30:15 +02:00
pt.py LangModels: add support for Portuguese / ISO-8859-1. 2016-09-21 00:01:07 +02:00
ro.py LangModels: Romanian support added. 2016-09-28 19:57:50 +02:00
sk.py script: regenerate Slovak and Slovene with better alphabet support. 2021-03-21 13:30:41 +01:00
sl.py src, script: add concept of alphabet_mapping in language models. 2021-03-21 15:54:24 +01:00
sv.py LangModels: add Swedish support. 2016-09-28 22:42:13 +02:00
th.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
tr.py script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
vi.py script, src: regenerate the Vietnamese model. 2021-03-21 01:17:55 +01:00