6 Commits

Author SHA1 Message Date
Jehan
4e967c9e88 src: new API to get the detected language.
This doesn't work for all probers yet, in particular not for the most
generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL.
It's still a good first step.

Right now, it returns the 2-character language code from ISO 639-1. A
using project could easily get the English language name from the
XML/json files provided by the iso-codes project. This project will also
allow to easily localize the language name in other languages through
gettext (this is what we do in GIMP for instance). I don't add any
dependency though and leave it to downstream projects to implement this.

I was also wondering if we want to support region information for cases
when it would make sense. I especially wondered about it for Chinese
encodings as some of them seem quite specific to a region (according to
Wikipedia at least). For the time being though, these just return "zh".
We'll see later if it makes sense to be more accurate (maybe depending
on reports?).
2020-04-23 18:39:49 +02:00
Jehan
210e52d99a LangModels: update the Greek language models.
I did this to improve the model after a user reported a Greek sutitle
badly detected (see commit e0eec3b).
It didn't help, but well... since I updated it with much more data from
Wikipedia. Let's just commit it!
2016-05-25 17:39:10 +02:00
Jehan
ad2f7212e2 LangModels: retraining Greek models with my training script.
This fixes our Greek/Windows-1253 test.
2015-12-13 18:02:11 +01:00
Jehan
d686fcc1cd LangModels: add illegal codepoints information on single byte charmaps. 2015-12-03 19:04:07 +01:00
Jehan
dbb4c1d2ff nsSBCharSetProber: replace the fixed 64 SAMPLE_SIZE...
... with per-language model "frequent character" count.
2015-11-29 23:51:55 +01:00
Jehan
2106173546 Move all Single-Byte language models to a subdirectory. 2015-11-27 23:11:23 +01:00