Jehan
9c3c37517c
LangModels: add Arabic support.
...
Models constructed for ISO-8859-6 and Windows-1256.
2015-12-13 18:42:16 +01:00
Jehan
ad2f7212e2
LangModels: retraining Greek models with my training script.
...
This fixes our Greek/Windows-1253 test.
2015-12-13 18:02:11 +01:00
Jehan
ffabb65712
LangModels: adding Spanish support.
...
With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.
2015-12-12 18:54:35 +01:00
Jehan
6b2722885a
BuildLangModel: forgot to add charset/language files.
2015-12-12 18:18:08 +01:00
Jehan
fb3c47a073
LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models.
...
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
2015-12-04 03:14:52 +01:00
Jehan
ffcd85f709
script: forgot to commit ISO-8859-9 and Turkish files.
2015-12-04 02:40:54 +01:00
Jehan
f0e122b506
LangModels: add Esperanto ISO-8859-3 language model.
2015-12-04 01:35:56 +01:00
Jehan
aa587a64bd
LangModels: adding German models for ISO-8859-1 and Windows-1252.
2015-12-03 23:58:41 +01:00
Jehan
0270b1e856
Adding French Windows-1252 support.
2015-12-03 21:22:30 +01:00
Jehan
192f8de165
BuildLangModel: build models with computed frequent characters count.
2015-11-30 00:04:44 +01:00
Jehan
429448199f
French language model: fix a start page.
...
Because of a bug in the Wikipedia querying Python library.
2015-11-29 23:55:03 +01:00
Jehan
b64831ff89
BuildLangModel: allow a list of start pages...
...
... and add a page with a word with œ in French to make sure
we have such words in our stats.
2015-11-29 15:51:23 +01:00
Jehan
7f290975ba
BuildLangModel: map different cases of the same character together.
...
With the new case_mapping lang property, we can consider upper and lower
case versions of the same character as one character.
This makes sense in some language, and would allow to enter some rarer
characters (but still in the main alphabet) inside the frequent
character list. For instance 'œ' and 'Œ' in French.
2015-11-29 02:14:48 +01:00
Jehan
00a78faa1d
BuildLangModel: the max_depth should be a script option...
...
... rather than a language property.
2015-11-29 01:59:28 +01:00
Jehan
0314f98ece
BuildLangModel.py: some in-progress script to build language models.
2015-11-29 01:30:04 +01:00