Jehan
178c6119b8
LangModels: add Windows-1258 support for Vietnamese.
...
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
2016-02-13 02:32:57 +01:00
Jehan
9c3c37517c
LangModels: add Arabic support.
...
Models constructed for ISO-8859-6 and Windows-1256.
2015-12-13 18:42:16 +01:00
Jehan
ad2f7212e2
LangModels: retraining Greek models with my training script.
...
This fixes our Greek/Windows-1253 test.
2015-12-13 18:02:11 +01:00
Jehan
ffabb65712
LangModels: adding Spanish support.
...
With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.
2015-12-12 18:54:35 +01:00
Jehan
a251753db8
LangModels: updating Hungarian language models.
2015-12-12 18:06:17 +01:00
Jehan
5691dc59a1
LangModels: rename Cyrillic models to Russian models.
...
Our language models are per-lang, not per script.
2015-12-04 03:27:29 +01:00
Jehan
fb3c47a073
LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models.
...
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
2015-12-04 03:14:52 +01:00
Jehan
5ee1c3ee39
LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9.
2015-12-04 02:35:09 +01:00
Jehan
f0e122b506
LangModels: add Esperanto ISO-8859-3 language model.
2015-12-04 01:35:56 +01:00
Jehan
aa587a64bd
LangModels: adding German models for ISO-8859-1 and Windows-1252.
2015-12-03 23:58:41 +01:00
Jehan
0270b1e856
Adding French Windows-1252 support.
2015-12-03 21:22:30 +01:00
Jehan
d686fcc1cd
LangModels: add illegal codepoints information on single byte charmaps.
2015-12-03 19:04:07 +01:00
Jehan
9cb5764b73
LangModels: update the French language models.
...
Fully built with the script.
2015-11-30 19:20:55 +01:00
Jehan
dbb4c1d2ff
nsSBCharSetProber: replace the fixed 64 SAMPLE_SIZE...
...
... with per-language model "frequent character" count.
2015-11-29 23:51:55 +01:00
Jehan
005fd98086
Add initial support for French with ISO-8859-1 and ISO-8859-15.
...
Mostly generated with a script from Wikipedia data (only the typical
positive ratio is slightly modified).
This is a first test before adding my generating script to the main tree.
2015-11-28 02:14:39 +01:00
Jehan
2106173546
Move all Single-Byte language models to a subdirectory.
2015-11-27 23:11:23 +01:00