4 Commits

Author SHA1 Message Date
Jehan
f8752f2b56 src, script: add concept of alphabet_mapping in language models.
This allows to handle cases where some characters are actually
alternative/variants of another. For instance, a same word can be
written with both variants, while both are considered correct and
equivalent. Browsing a bit Slovenian Wikipedia, it looks like they only
use them for titles there.

I use this the first time on characters with diacritics in Slovene.
Indeed these are so rarely used that they would hardly show in the stats
and worse, any sequence using these in tested text would likely show as
negative sequences hence drop the confidence in Slovenian. As a
consequence, various Slovene text would show up as Slovak as it's close
enough and contains the same character with diacritics in a common way.
2021-03-21 15:54:24 +01:00
Jehan
5fe9a7e1df script: regenerate Slovak and Slovene with better alphabet support.
I was missing some characters, especially in the Slovak alphabet.
Oppositely the Slovene alphabet does not use 4 of the common ASCII
alphabet.
2021-03-21 13:30:41 +01:00
Jehan
5c3a2e8037 src, script: regenerate all existing language models.
Now making sure that we have a generic language model working with UTF-8
for all 26 supported models which had single-byte encoding support until
now.
2021-03-17 02:07:17 +01:00
Jehan
d62154bd6e LangModels: add Slovene support.
Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and
MAC-CENTRALEUROPE.
Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet
2016-09-28 22:13:17 +02:00