10 Commits

Author SHA1 Message Date
Jehan
db836fad63 script, src: generate more code for language and sequence model listing.
Right now, each time we add new language or new charset support, we have
too many pieces of code not to forget to edit. The script
script/BuildLangModel.py will now take care of the main parts: listing
the sequence models, listing the generic language models and computing
the numbers for each listing.

Furthermore the script will now end with a TODO list of the parts which
are still to be done manually (2 functions to edit and a CMakeLists).

Finally the script now allows to give a list of languages to edit rather
of having to run it with languages one by one. It also allows 2 special
code: "none", which will retrain none of the languages, but will
re-generate only the new generated listings; and "all" which will
retrain all models (useful in particulare when we change the model
formats or usage and want to regenerate everything).
2022-12-18 17:23:34 +01:00
Jehan
e6e51d9fe8 src: all language models now rebuilt after the fix. 2022-12-15 14:31:55 +01:00
Jehan
6bb1b3e101 scripts: all language models rebuilt with the new ratio data. 2022-12-14 20:16:44 +01:00
Jehan
eb8308d50a src, script: regenerate all existing language models.
Now making sure that we have a generic language model working with UTF-8
for all 26 supported models which had single-byte encoding support until
now.
2022-12-14 00:23:13 +01:00
Jehan
5a949265d5 src: new API to get the detected language.
This doesn't work for all probers yet, in particular not for the most
generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL.
It's still a good first step.

Right now, it returns the 2-character language code from ISO 639-1. A
using project could easily get the English language name from the
XML/json files provided by the iso-codes project. This project will also
allow to easily localize the language name in other languages through
gettext (this is what we do in GIMP for instance). I don't add any
dependency though and leave it to downstream projects to implement this.

I was also wondering if we want to support region information for cases
when it would make sense. I especially wondered about it for Chinese
encodings as some of them seem quite specific to a region (according to
Wikipedia at least). For the time being though, these just return "zh".
We'll see later if it makes sense to be more accurate (maybe depending
on reports?).
2022-12-14 00:23:13 +01:00
Jehan
210e52d99a LangModels: update the Greek language models.
I did this to improve the model after a user reported a Greek sutitle
badly detected (see commit e0eec3b).
It didn't help, but well... since I updated it with much more data from
Wikipedia. Let's just commit it!
2016-05-25 17:39:10 +02:00
Jehan
ad2f7212e2 LangModels: retraining Greek models with my training script.
This fixes our Greek/Windows-1253 test.
2015-12-13 18:02:11 +01:00
Jehan
d686fcc1cd LangModels: add illegal codepoints information on single byte charmaps. 2015-12-03 19:04:07 +01:00
Jehan
dbb4c1d2ff nsSBCharSetProber: replace the fixed 64 SAMPLE_SIZE...
... with per-language model "frequent character" count.
2015-11-29 23:51:55 +01:00
Jehan
2106173546 Move all Single-Byte language models to a subdirectory. 2015-11-27 23:11:23 +01:00