Jehan
ea2f4dd40f
LangModels: new support for Latvian / ISO-8859-13.
...
Test text extracted from: https://lv.wikipedia.org/wiki/Vinsents_van_Gogs
2016-09-20 23:29:53 +02:00
Jehan
7cb3dd9ddd
LangModels: add support for Lithuanian / ISO-8859-13.
...
Test text extracted from https://lt.wikipedia.org/wiki/Vincent_van_Gogh .
2016-09-20 23:09:24 +02:00
Jehan
210e52d99a
LangModels: update the Greek language models.
...
I did this to improve the model after a user reported a Greek sutitle
badly detected (see commit e0eec3b).
It didn't help, but well... since I updated it with much more data from
Wikipedia. Let's just commit it!
2016-05-25 17:39:10 +02:00
Jehan
198190461e
script: move the Wikipedia title syntax cleaning to BuildLangModel.py.
2016-02-21 16:20:22 +01:00
Jehan
d24bd7d578
script: Wikipedia API's python wrapper does not return garbage text anymore.
...
I can't see new commits since 2014. So I am assuming the issue was on
Wikipedia side and that it has been fixed.
2016-02-21 16:07:10 +01:00
Jehan
923d264470
LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15).
...
Test for ISO-8859-1 is disabled for now since the difference is not big
enough, as for characters used in Danish, between ISO-8859-1 and
ISO-8859-15. Therefore the first to be declared "wins".
Let's see to improve this later.
Test contents from:
https://da.wikipedia.org/wiki/Eurosymbol
https://da.wikipedia.org/wiki/Dansk_%28sprog%29
2016-02-19 19:10:41 +01:00
Jehan
98b5e52252
LangModels: add VISCII encoding support and retrain Vietnamese model.
2016-02-13 03:51:18 +01:00
Jehan
178c6119b8
LangModels: add Windows-1258 support for Vietnamese.
...
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
2016-02-13 02:32:57 +01:00
Jehan
9c3c37517c
LangModels: add Arabic support.
...
Models constructed for ISO-8859-6 and Windows-1256.
2015-12-13 18:42:16 +01:00
Jehan
ad2f7212e2
LangModels: retraining Greek models with my training script.
...
This fixes our Greek/Windows-1253 test.
2015-12-13 18:02:11 +01:00
Jehan
ffabb65712
LangModels: adding Spanish support.
...
With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.
2015-12-12 18:54:35 +01:00
Jehan
6b2722885a
BuildLangModel: forgot to add charset/language files.
2015-12-12 18:18:08 +01:00
Jehan
fb3c47a073
LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models.
...
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
2015-12-04 03:14:52 +01:00
Jehan
ffcd85f709
script: forgot to commit ISO-8859-9 and Turkish files.
2015-12-04 02:40:54 +01:00
Jehan
f0e122b506
LangModels: add Esperanto ISO-8859-3 language model.
2015-12-04 01:35:56 +01:00
Jehan
aa587a64bd
LangModels: adding German models for ISO-8859-1 and Windows-1252.
2015-12-03 23:58:41 +01:00
Jehan
0270b1e856
Adding French Windows-1252 support.
2015-12-03 21:22:30 +01:00
Jehan
192f8de165
BuildLangModel: build models with computed frequent characters count.
2015-11-30 00:04:44 +01:00
Jehan
429448199f
French language model: fix a start page.
...
Because of a bug in the Wikipedia querying Python library.
2015-11-29 23:55:03 +01:00
Jehan
b64831ff89
BuildLangModel: allow a list of start pages...
...
... and add a page with a word with œ in French to make sure
we have such words in our stats.
2015-11-29 15:51:23 +01:00
Jehan
7f290975ba
BuildLangModel: map different cases of the same character together.
...
With the new case_mapping lang property, we can consider upper and lower
case versions of the same character as one character.
This makes sense in some language, and would allow to enter some rarer
characters (but still in the main alphabet) inside the frequent
character list. For instance 'œ' and 'Œ' in French.
2015-11-29 02:14:48 +01:00
Jehan
00a78faa1d
BuildLangModel: the max_depth should be a script option...
...
... rather than a language property.
2015-11-29 01:59:28 +01:00
Jehan
0314f98ece
BuildLangModel.py: some in-progress script to build language models.
2015-11-29 01:30:04 +01:00