5 Commits

Author SHA1 Message Date
Jehan
0fe51d3851 Issue #21: Greek CP737 support.
It actually breaks "zh:big5" so I'm going to hold-off a bit. Adding more
language and charset support is slowly starting to show the limitations
of our legacy multi-byte charset supports, since I haven't really
touched these since the original implementation of Mozilla.

It might be time to start reviewing these parts of the code.

The test file contents comes from 'Μαρμότα' page on Wikipedia in Greek
(though since 2 letters are missing in this encoding, despite its
popularity for Greek, I had to be careful in choosing pieces of text
without such letters).
2022-12-18 22:33:12 +01:00
Jehan
210e52d99a LangModels: update the Greek language models.
I did this to improve the model after a user reported a Greek sutitle
badly detected (see commit e0eec3b).
It didn't help, but well... since I updated it with much more data from
Wikipedia. Let's just commit it!
2016-05-25 17:39:10 +02:00
Jehan
198190461e script: move the Wikipedia title syntax cleaning to BuildLangModel.py. 2016-02-21 16:20:22 +01:00
Jehan
d24bd7d578 script: Wikipedia API's python wrapper does not return garbage text anymore.
I can't see new commits since 2014. So I am assuming the issue was on
Wikipedia side and that it has been fixed.
2016-02-21 16:07:10 +01:00
Jehan
ad2f7212e2 LangModels: retraining Greek models with my training script.
This fixes our Greek/Windows-1253 test.
2015-12-13 18:02:11 +01:00