14 Commits

Author SHA1 Message Date
Ilya Tumaykin
29f18210b1
cmake: hardcode less 2016-03-22 01:23:04 +03:00
Ilya Tumaykin
b44be77be6
cmake: uniform indent everywhere
Indent with tabs, remove leading/trailing blank lines and spaces.
2016-03-21 01:07:41 +03:00
Jehan
923d264470 LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15).
Test for ISO-8859-1 is disabled for now since the difference is not big
enough, as for characters used in Danish, between ISO-8859-1 and
ISO-8859-15. Therefore the first to be declared "wins".
Let's see to improve this later.
Test contents from:
https://da.wikipedia.org/wiki/Eurosymbol
https://da.wikipedia.org/wiki/Dansk_%28sprog%29
2016-02-19 19:10:41 +01:00
Jehan
ad2f7212e2 LangModels: retraining Greek models with my training script.
This fixes our Greek/Windows-1253 test.
2015-12-13 18:02:11 +01:00
Jehan
1b4c62ac21 tests: test files for Spanish.
I disable only ISO-8859-15 which is similar to ISO-8859-1 for all
Spanish letters. Unfortunately illegal codepoints are similar too.
Difference should likely be done on symbols (like the euro symbol)
but our current algorithm does nothing about this for charset
comparison.
Text from https://es.wikipedia.org/wiki/España
2015-12-12 18:55:43 +01:00
Jehan
2bade77bf9 tests: update Window-1250 test file for Hungarian.
ISO-8859-2 and Windows-1250 are absolutely similar for all letters in
the Hungarian alphabet. So for most texts, it is not an error to return
one charset or the other.
What could make the difference is for instance that Windows-1250 has
some symbols where ISO-8859-2 has control characters, like quotes,
dashes, the euro symbol…
Since control characters have a negative impact on confidence now,
texts with such symbols would tend towards Windows-1250 decision.
The new test file has such quote symbols.
2015-12-12 18:12:08 +01:00
Jehan
15afc5c593 test: add a Hungarian Windows-1250 test but skip it for now.
Text from: https://hu.wikipedia.org/wiki/Magyar_nyelv
2015-12-03 21:18:55 +01:00
Jehan
683255278d Re-enable Hungarian language models.
Now that we have at least one model for ISO-8859-1, the risk of
detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.
2015-12-02 22:24:36 +01:00
Jehan
f4f9fc3f28 test: reenable Windows-1251 test for Russian.
Commit 4f1c3ff actually fixed it!
2015-12-02 21:53:27 +01:00
Jehan
a8e9de307b Add UTF-16 test files without BOM...
... and disable the tests for now for these since uchardet is not able
to detect UTF-16 without a BOM as for now.
2015-11-28 19:50:18 +01:00
Jehan
005fd98086 Add initial support for French with ISO-8859-1 and ISO-8859-15.
Mostly generated with a script from Wikipedia data (only the typical
positive ratio is slightly modified).
This is a first test before adding my generating script to the main tree.
2015-11-28 02:14:39 +01:00
Jehan
5dcff7b241 Hide away tests known to fail.
Some charsets are simply not supported (ex: fr:iso-8859-1), some are
temporarily deactivated (ex: hu:iso-8859-2) and some are wrongly
detected as closely related charsets.
These were broken (or not efficient) from the start, and there is no
need to pollute the `make test` output with these, which may make us
miss when actual regressions will occur. So let's hide these away for
now until we can improve the situation.
2015-11-18 20:02:58 +01:00
Jehan
4b38e68aa2 CMake tests: separate the lang and charset with colon...
... rather than an hyphen. It makes it easier to read.
2015-11-18 19:42:35 +01:00
Jehan
eb727d3aca Add automatic testing against every test file. 2015-11-18 18:18:27 +01:00