68 Commits

Author SHA1 Message Date
Jehan
5dcff7b241 Hide away tests known to fail.
Some charsets are simply not supported (ex: fr:iso-8859-1), some are
temporarily deactivated (ex: hu:iso-8859-2) and some are wrongly
detected as closely related charsets.
These were broken (or not efficient) from the start, and there is no
need to pollute the `make test` output with these, which may make us
miss when actual regressions will occur. So let's hide these away for
now until we can improve the situation.
2015-11-18 20:02:58 +01:00
Jehan
4b38e68aa2 CMake tests: separate the lang and charset with colon...
... rather than an hyphen. It makes it easier to read.
2015-11-18 19:42:35 +01:00
Jehan
0d70a36910 Adding some more test files for Russian and Chinese.
Taken from:
https://zh.wikipedia.org/wiki/EUC
https://ru.wikipedia.org/wiki/КОИ-8
And rename a file s/utf8.txt/utf-8.txt/ to fix a build test.
2015-11-18 19:27:38 +01:00
Jehan
eb727d3aca Add automatic testing against every test file. 2015-11-18 18:18:27 +01:00
Jehan
f303a41735 Add Thai test file for UTF-8.
Text from Thai Wikipedia:
https://th.wikipedia.org/wiki/ยูนิโคด
2015-11-18 03:26:34 +01:00
Jehan
e7c8114233 Add Hebrew test files.
Texts from Hebrew Wikipedia:
https://he.wikipedia.org/wiki/עברית
https://he.wikipedia.org/wiki/ISO_8859
https://he.wikipedia.org/wiki/UTF-8
uchardet fails to detect the ISO-8859-8 files and detects it as
Windows-1255, which is probably acceptable since it is apparently
an "almost compatible superset". It may be worth trying to make
more complete test files in the future to demonstrate the differences.
2015-11-18 03:16:18 +01:00
Jehan
601e59bd83 Add Greek test files.
Taken from Greek Wikipedia:
https://el.wikipedia.org/wiki/UTF-8
https://el.wikipedia.org/wiki/ISO_8859-7
https://el.wikipedia.org/wiki/ISO_8859-7#Windows-1253
Windows-1253 test fails and returns "ISO-8859-7". They are actually
fairly close for main letters, except for Ά, which make them difficult
to differentiate.
2015-11-18 02:57:09 +01:00
Jehan
c8532f63a8 Adding UTF-8 file for Korean.
Text taken from Korean Wikipedia:
https://ko.wikipedia.org/wiki/UTF-8
2015-11-18 02:36:33 +01:00
Jehan
a76c0786b3 Adding test files for main Japanese encoding...
... taken from the following Japanese Wikipedia pages:
https://ja.wikipedia.org/wiki/Extended_Unix_Code
https://ja.wikipedia.org/wiki/ISO/IEC_2022
https://ja.wikipedia.org/wiki/UTF-8
2015-11-17 21:24:47 +01:00
Jehan
0efcdfa546 Reorganize test files in language subdirectories.
I realize that the language information a text has been written in is
very important since it would completely change the character
distribution. Our test files should take this into account, and we
should create several test files in different languages for encoding
used in various languages.
2015-11-17 21:12:39 +01:00
Jehan
192b0e7d51 Add test files for ISO-8859-[12].
Taken from French page about ISO-8859-1:
https://fr.wikipedia.org/wiki/ISO_8859-1
... and Hungarian Wikipedia page about ISO-8859-2:
https://hu.wikipedia.org/wiki/ISO/IEC_8859-2
We don't have support for ISO-8859-1, and both these files are detected
as "WINDOWS-1252" (which is acceptable for iso-8859-1.txt since
Windows-1252 is a superset of ISO-8859-1). ISO-8859-2 support is
disabled because the ISO-8859-1 file would be detected as ISO-8859-2,
which would in turn be a clear error.
2015-11-17 19:39:58 +01:00
Jehan
3f3f4b8011 Add a ISO-8859-5 test file.
Text taken from Russian Wikipedia page about ISO-8859-5:
https://ru.wikipedia.org/wiki/ISO_8859-5
2015-11-17 19:11:59 +01:00
Jehan
bafccfcea8 Add a Windows-1251 test files.
Texts taken from Bulgarian Wikipedia page about Windows-1251:
https://bg.wikipedia.org/wiki/Windows-1251
... and Russian Wikipedia page about Windows-1251:
https://ru.wikipedia.org/wiki/Windows-1251
The Bulgarian file detection is right, but the Russian detection
returns "MAC-CYRILLIC", which is an error and should be fixed.
2015-11-17 19:09:37 +01:00
Jehan
8216f7b395 Add an ISO-2022-KR test file.
Text taken from Korean Wikipedia page about the ISO-2022-KR:
https://ko.wikipedia.org/wiki/ISO/IEC_2022
2015-11-17 18:23:46 +01:00
Jehan
9172b763d1 Add TIS-620 in README (Thai language) and a test file.
Test text based on Thai Wikipedia page about the TIS-620 encoding:
https://th.wikipedia.org/wiki/TIS-620
2015-11-17 17:39:45 +01:00
Jehan
362e36d1ed Add EUC-KR test file.
Contains text taken from Wikipedia on EUC-KR page in Korean.
https://ko.wikipedia.org/wiki/EUC-KR
I added it as a simili-subtitle file because as the original Mozilla
paper says: "The input text may contain extraneous noises which have no
relation to its encoding, e.g. HTML tags, non-native words".
Therefore I feel it is important to have test files a little noisy if
possible, in order to test our resistance to noise in our algorithm.
2015-11-17 16:36:17 +01:00
byvoid
eaab1d7868 Set permissions. 2011-07-11 18:08:26 +08:00
BYVoid
86b4739e5a Add test cases. 2011-07-11 14:57:31 +08:00