Jehan
d9d347099e
BuildLangModel: fix some minor comment from a previous spec.
2015-11-30 00:09:23 +01:00
Jehan
192f8de165
BuildLangModel: build models with computed frequent characters count.
2015-11-30 00:04:44 +01:00
Jehan
b64831ff89
BuildLangModel: allow a list of start pages...
...
... and add a page with a word with œ in French to make sure
we have such words in our stats.
2015-11-29 15:51:23 +01:00
Jehan
dce79a6631
BuildLangModel: the SequenceModel naming must include the language name.
2015-11-29 15:49:56 +01:00
Jehan
c59465adfc
BuildLangModel: save lang model directly in the right directory.
2015-11-29 13:26:10 +01:00
Jehan
290fbd2e2e
BuildLangModel: add the licensing header to generated files.
2015-11-29 02:26:33 +01:00
Jehan
7f290975ba
BuildLangModel: map different cases of the same character together.
...
With the new case_mapping lang property, we can consider upper and lower
case versions of the same character as one character.
This makes sense in some language, and would allow to enter some rarer
characters (but still in the main alphabet) inside the frequent
character list. For instance 'œ' and 'Œ' in French.
2015-11-29 02:14:48 +01:00
Jehan
00a78faa1d
BuildLangModel: the max_depth should be a script option...
...
... rather than a language property.
2015-11-29 01:59:28 +01:00
Jehan
274386f424
BuildLangModel: add a --max-page option to limit data size.
...
This is mostly useful for debugging while we don't want to wait forever
to test the script.
2015-11-29 01:42:36 +01:00
Jehan
0314f98ece
BuildLangModel.py: some in-progress script to build language models.
2015-11-29 01:30:04 +01:00