40 Commits

Author SHA1 Message Date
Jehan
210e52d99a LangModels: update the Greek language models.
I did this to improve the model after a user reported a Greek sutitle
badly detected (see commit e0eec3b).
It didn't help, but well... since I updated it with much more data from
Wikipedia. Let's just commit it!
2016-05-25 17:39:10 +02:00
Jehan
6cd8c322ad script: stupid bug on BuildLangModel.py. 2016-05-25 15:23:36 +02:00
Jehan
198190461e script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
Jehan
d24bd7d578 script: Wikipedia API's python wrapper does not return garbage text anymore.
I can't see new commits since 2014. So I am assuming the issue was on
Wikipedia side and that it has been fixed.
2016-02-21 16:07:10 +01:00
Jehan
37024460fe script: add a README file dedicated to adding new support. 2016-02-21 16:06:11 +01:00
Jehan
923d264470 LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15).
Test for ISO-8859-1 is disabled for now since the difference is not big
enough, as for characters used in Danish, between ISO-8859-1 and
ISO-8859-15. Therefore the first to be declared "wins".
Let's see to improve this later.
Test contents from:
https://da.wikipedia.org/wiki/Eurosymbol
https://da.wikipedia.org/wiki/Dansk_%28sprog%29
2016-02-19 19:10:41 +01:00
Jehan
98b5e52252 LangModels: add VISCII encoding support and retrain Vietnamese model. 2016-02-13 03:51:18 +01:00
Jehan
600cf76a76 BuildLangModel: try using iconv for conversion when support missing...
... in python. For instance I had the case where the VISCII encoding is
supported by iconv but not by encode/decode() function in core python.
2016-02-13 03:47:41 +01:00
Jehan
178c6119b8 LangModels: add Windows-1258 support for Vietnamese.
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
2016-02-13 02:32:57 +01:00
Jehan
27135a8880 BuildLangModel: printing a message when discarding a page. 2016-02-13 02:27:15 +01:00
Jehan
9c3c37517c LangModels: add Arabic support.
Models constructed for ISO-8859-6 and Windows-1256.
2015-12-13 18:42:16 +01:00
Jehan
ad2f7212e2 LangModels: retraining Greek models with my training script.
This fixes our Greek/Windows-1253 test.
2015-12-13 18:02:11 +01:00
Jehan
ffabb65712 LangModels: adding Spanish support.
With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.
2015-12-12 18:54:35 +01:00
Jehan
055332ac7d BuildLangModel: allow the alphabet list to be written in string format. 2015-12-12 18:50:29 +01:00
Jehan
6b2722885a BuildLangModel: forgot to add charset/language files. 2015-12-12 18:18:08 +01:00
Jehan
7b4eb9827e BuildLangModel: add an exception handler on charset spec errors. 2015-12-12 18:00:30 +01:00
Jehan
569509f844 BuildLangModel: forgot to add logs for Thai models generation. 2015-12-04 03:26:52 +01:00
Jehan
fb3c47a073 LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models.
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
2015-12-04 03:14:52 +01:00
Jehan
ffcd85f709 script: forgot to commit ISO-8859-9 and Turkish files. 2015-12-04 02:40:54 +01:00
Jehan
5ee1c3ee39 LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9. 2015-12-04 02:35:09 +01:00
Jehan
22b9ed2d4f BuildLangModel: add concept of custom_case_mapping…
… for langs for which Python lower() algorithm fails.
In particular Turkish dotted/dotless 'i' does not follow same rules
as common western languages.
Lowercase for 'I' is indeed not 'i' but 'ı'.
Uppercase for 'i' is indeed not 'I' but 'İ'.
2015-12-04 02:29:40 +01:00
Jehan
f0e122b506 LangModels: add Esperanto ISO-8859-3 language model. 2015-12-04 01:35:56 +01:00
Jehan
a167bd5e42 BuildLangModel: lowercase only when resulting char has a composed form.
I had the case with the Turkish dotted 'İ' that lowercasing it with
Python algorithm returned me a decomposed character that it was not able
to recompose. Therefore ord() raised a TypeError because the string
length was 2.
2015-12-04 01:30:21 +01:00
Jehan
aa587a64bd LangModels: adding German models for ISO-8859-1 and Windows-1252. 2015-12-03 23:58:41 +01:00
Jehan
0270b1e856 Adding French Windows-1252 support. 2015-12-03 21:22:30 +01:00
Jehan
9cb5764b73 LangModels: update the French language models.
Fully built with the script.
2015-11-30 19:20:55 +01:00
Jehan
dc5caa46bc BuildLangModel: fix hardcoded file names. 2015-11-30 19:18:25 +01:00
Jehan
3e5d37a6b5 BuildLangModel: process pages level per level.
I.e. horizontally or "breadth first" rather than vertical tree traversal.
This allows to make sure all the start pages in particular are searched,
when using max_page option.
2015-11-30 19:12:04 +01:00
Jehan
d9d347099e BuildLangModel: fix some minor comment from a previous spec. 2015-11-30 00:09:23 +01:00
Jehan
192f8de165 BuildLangModel: build models with computed frequent characters count. 2015-11-30 00:04:44 +01:00
Jehan
429448199f French language model: fix a start page.
Because of a bug in the Wikipedia querying Python library.
2015-11-29 23:55:03 +01:00
Jehan
b64831ff89 BuildLangModel: allow a list of start pages...
... and add a page with a word with œ in French to make sure
we have such words in our stats.
2015-11-29 15:51:23 +01:00
Jehan
dce79a6631 BuildLangModel: the SequenceModel naming must include the language name. 2015-11-29 15:49:56 +01:00
Jehan
c59465adfc BuildLangModel: save lang model directly in the right directory. 2015-11-29 13:26:10 +01:00
Jehan
290fbd2e2e BuildLangModel: add the licensing header to generated files. 2015-11-29 02:26:33 +01:00
Jehan
7f290975ba BuildLangModel: map different cases of the same character together.
With the new case_mapping lang property, we can consider upper and lower
case versions of the same character as one character.
This makes sense in some language, and would allow to enter some rarer
characters (but still in the main alphabet) inside the frequent
character list. For instance 'œ' and 'Œ' in French.
2015-11-29 02:14:48 +01:00
Jehan
00a78faa1d BuildLangModel: the max_depth should be a script option...
... rather than a language property.
2015-11-29 01:59:28 +01:00
Jehan
274386f424 BuildLangModel: add a --max-page option to limit data size.
This is mostly useful for debugging while we don't want to wait forever
to test the script.
2015-11-29 01:42:36 +01:00
Jehan
0314f98ece BuildLangModel.py: some in-progress script to build language models. 2015-11-29 01:30:04 +01:00
BYVoid
56a4c0d86c Add authors. 2011-07-13 20:16:23 +08:00