Jehan
923d264470
LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15).
...
Test for ISO-8859-1 is disabled for now since the difference is not big
enough, as for characters used in Danish, between ISO-8859-1 and
ISO-8859-15. Therefore the first to be declared "wins".
Let's see to improve this later.
Test contents from:
https://da.wikipedia.org/wiki/Eurosymbol
https://da.wikipedia.org/wiki/Dansk_%28sprog%29
2016-02-19 19:10:41 +01:00
Jehan
98b5e52252
LangModels: add VISCII encoding support and retrain Vietnamese model.
2016-02-13 03:51:18 +01:00
Jehan
600cf76a76
BuildLangModel: try using iconv for conversion when support missing...
...
... in python. For instance I had the case where the VISCII encoding is
supported by iconv but not by encode/decode() function in core python.
2016-02-13 03:47:41 +01:00
Jehan
178c6119b8
LangModels: add Windows-1258 support for Vietnamese.
...
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
2016-02-13 02:32:57 +01:00
Jehan
27135a8880
BuildLangModel: printing a message when discarding a page.
2016-02-13 02:27:15 +01:00
Jehan
9c3c37517c
LangModels: add Arabic support.
...
Models constructed for ISO-8859-6 and Windows-1256.
2015-12-13 18:42:16 +01:00
Jehan
ad2f7212e2
LangModels: retraining Greek models with my training script.
...
This fixes our Greek/Windows-1253 test.
2015-12-13 18:02:11 +01:00
Jehan
ffabb65712
LangModels: adding Spanish support.
...
With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.
2015-12-12 18:54:35 +01:00
Jehan
055332ac7d
BuildLangModel: allow the alphabet list to be written in string format.
2015-12-12 18:50:29 +01:00
Jehan
6b2722885a
BuildLangModel: forgot to add charset/language files.
2015-12-12 18:18:08 +01:00
Jehan
7b4eb9827e
BuildLangModel: add an exception handler on charset spec errors.
2015-12-12 18:00:30 +01:00
Jehan
569509f844
BuildLangModel: forgot to add logs for Thai models generation.
2015-12-04 03:26:52 +01:00
Jehan
fb3c47a073
LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models.
...
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
2015-12-04 03:14:52 +01:00
Jehan
ffcd85f709
script: forgot to commit ISO-8859-9 and Turkish files.
2015-12-04 02:40:54 +01:00
Jehan
5ee1c3ee39
LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9.
2015-12-04 02:35:09 +01:00
Jehan
22b9ed2d4f
BuildLangModel: add concept of custom_case_mapping…
...
… for langs for which Python lower() algorithm fails.
In particular Turkish dotted/dotless 'i' does not follow same rules
as common western languages.
Lowercase for 'I' is indeed not 'i' but 'ı'.
Uppercase for 'i' is indeed not 'I' but 'İ'.
2015-12-04 02:29:40 +01:00
Jehan
f0e122b506
LangModels: add Esperanto ISO-8859-3 language model.
2015-12-04 01:35:56 +01:00
Jehan
a167bd5e42
BuildLangModel: lowercase only when resulting char has a composed form.
...
I had the case with the Turkish dotted 'İ' that lowercasing it with
Python algorithm returned me a decomposed character that it was not able
to recompose. Therefore ord() raised a TypeError because the string
length was 2.
2015-12-04 01:30:21 +01:00
Jehan
aa587a64bd
LangModels: adding German models for ISO-8859-1 and Windows-1252.
2015-12-03 23:58:41 +01:00
Jehan
0270b1e856
Adding French Windows-1252 support.
2015-12-03 21:22:30 +01:00
Jehan
9cb5764b73
LangModels: update the French language models.
...
Fully built with the script.
2015-11-30 19:20:55 +01:00
Jehan
dc5caa46bc
BuildLangModel: fix hardcoded file names.
2015-11-30 19:18:25 +01:00
Jehan
3e5d37a6b5
BuildLangModel: process pages level per level.
...
I.e. horizontally or "breadth first" rather than vertical tree traversal.
This allows to make sure all the start pages in particular are searched,
when using max_page option.
2015-11-30 19:12:04 +01:00
Jehan
d9d347099e
BuildLangModel: fix some minor comment from a previous spec.
2015-11-30 00:09:23 +01:00
Jehan
192f8de165
BuildLangModel: build models with computed frequent characters count.
2015-11-30 00:04:44 +01:00
Jehan
429448199f
French language model: fix a start page.
...
Because of a bug in the Wikipedia querying Python library.
2015-11-29 23:55:03 +01:00
Jehan
b64831ff89
BuildLangModel: allow a list of start pages...
...
... and add a page with a word with œ in French to make sure
we have such words in our stats.
2015-11-29 15:51:23 +01:00
Jehan
dce79a6631
BuildLangModel: the SequenceModel naming must include the language name.
2015-11-29 15:49:56 +01:00
Jehan
c59465adfc
BuildLangModel: save lang model directly in the right directory.
2015-11-29 13:26:10 +01:00
Jehan
290fbd2e2e
BuildLangModel: add the licensing header to generated files.
2015-11-29 02:26:33 +01:00
Jehan
7f290975ba
BuildLangModel: map different cases of the same character together.
...
With the new case_mapping lang property, we can consider upper and lower
case versions of the same character as one character.
This makes sense in some language, and would allow to enter some rarer
characters (but still in the main alphabet) inside the frequent
character list. For instance 'œ' and 'Œ' in French.
2015-11-29 02:14:48 +01:00
Jehan
00a78faa1d
BuildLangModel: the max_depth should be a script option...
...
... rather than a language property.
2015-11-29 01:59:28 +01:00
Jehan
274386f424
BuildLangModel: add a --max-page option to limit data size.
...
This is mostly useful for debugging while we don't want to wait forever
to test the script.
2015-11-29 01:42:36 +01:00
Jehan
0314f98ece
BuildLangModel.py: some in-progress script to build language models.
2015-11-29 01:30:04 +01:00
BYVoid
56a4c0d86c
Add authors.
2011-07-13 20:16:23 +08:00