uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-07 17:26:41 +08:00

Author	SHA1	Message	Date
Jehan	5dcff7b241	Hide away tests known to fail. Some charsets are simply not supported (ex: fr:iso-8859-1), some are temporarily deactivated (ex: hu:iso-8859-2) and some are wrongly detected as closely related charsets. These were broken (or not efficient) from the start, and there is no need to pollute the `make test` output with these, which may make us miss when actual regressions will occur. So let's hide these away for now until we can improve the situation.	2015-11-18 20:02:58 +01:00
Jehan	4b38e68aa2	CMake tests: separate the lang and charset with colon... ... rather than an hyphen. It makes it easier to read.	2015-11-18 19:42:35 +01:00
Jehan	0d70a36910	Adding some more test files for Russian and Chinese. Taken from: https://zh.wikipedia.org/wiki/EUC https://ru.wikipedia.org/wiki/КОИ-8 And rename a file s/utf8.txt/utf-8.txt/ to fix a build test.	2015-11-18 19:27:38 +01:00
Jehan	eb727d3aca	Add automatic testing against every test file.	2015-11-18 18:18:27 +01:00
Jehan	f303a41735	Add Thai test file for UTF-8. Text from Thai Wikipedia: https://th.wikipedia.org/wiki/ยูนิโคด	2015-11-18 03:26:34 +01:00
Jehan	e7c8114233	Add Hebrew test files. Texts from Hebrew Wikipedia: https://he.wikipedia.org/wiki/עברית https://he.wikipedia.org/wiki/ISO_8859 https://he.wikipedia.org/wiki/UTF-8 uchardet fails to detect the ISO-8859-8 files and detects it as Windows-1255, which is probably acceptable since it is apparently an "almost compatible superset". It may be worth trying to make more complete test files in the future to demonstrate the differences.	2015-11-18 03:16:18 +01:00
Jehan	601e59bd83	Add Greek test files. Taken from Greek Wikipedia: https://el.wikipedia.org/wiki/UTF-8 https://el.wikipedia.org/wiki/ISO_8859-7 https://el.wikipedia.org/wiki/ISO_8859-7#Windows-1253 Windows-1253 test fails and returns "ISO-8859-7". They are actually fairly close for main letters, except for Ά, which make them difficult to differentiate.	2015-11-18 02:57:09 +01:00
Jehan	c8532f63a8	Adding UTF-8 file for Korean. Text taken from Korean Wikipedia: https://ko.wikipedia.org/wiki/UTF-8	2015-11-18 02:36:33 +01:00
Jehan	a76c0786b3	Adding test files for main Japanese encoding... ... taken from the following Japanese Wikipedia pages: https://ja.wikipedia.org/wiki/Extended_Unix_Code https://ja.wikipedia.org/wiki/ISO/IEC_2022 https://ja.wikipedia.org/wiki/UTF-8	2015-11-17 21:24:47 +01:00
Jehan	0efcdfa546	Reorganize test files in language subdirectories. I realize that the language information a text has been written in is very important since it would completely change the character distribution. Our test files should take this into account, and we should create several test files in different languages for encoding used in various languages.	2015-11-17 21:12:39 +01:00
Jehan	192b0e7d51	Add test files for ISO-8859-[12]. Taken from French page about ISO-8859-1: https://fr.wikipedia.org/wiki/ISO_8859-1 ... and Hungarian Wikipedia page about ISO-8859-2: https://hu.wikipedia.org/wiki/ISO/IEC_8859-2 We don't have support for ISO-8859-1, and both these files are detected as "WINDOWS-1252" (which is acceptable for iso-8859-1.txt since Windows-1252 is a superset of ISO-8859-1). ISO-8859-2 support is disabled because the ISO-8859-1 file would be detected as ISO-8859-2, which would in turn be a clear error.	2015-11-17 19:39:58 +01:00
Jehan	3f3f4b8011	Add a ISO-8859-5 test file. Text taken from Russian Wikipedia page about ISO-8859-5: https://ru.wikipedia.org/wiki/ISO_8859-5	2015-11-17 19:11:59 +01:00
Jehan	bafccfcea8	Add a Windows-1251 test files. Texts taken from Bulgarian Wikipedia page about Windows-1251: https://bg.wikipedia.org/wiki/Windows-1251 ... and Russian Wikipedia page about Windows-1251: https://ru.wikipedia.org/wiki/Windows-1251 The Bulgarian file detection is right, but the Russian detection returns "MAC-CYRILLIC", which is an error and should be fixed.	2015-11-17 19:09:37 +01:00
Jehan	8216f7b395	Add an ISO-2022-KR test file. Text taken from Korean Wikipedia page about the ISO-2022-KR: https://ko.wikipedia.org/wiki/ISO/IEC_2022	2015-11-17 18:23:46 +01:00
Jehan	9172b763d1	Add TIS-620 in README (Thai language) and a test file. Test text based on Thai Wikipedia page about the TIS-620 encoding: https://th.wikipedia.org/wiki/TIS-620	2015-11-17 17:39:45 +01:00
Jehan	362e36d1ed	Add EUC-KR test file. Contains text taken from Wikipedia on EUC-KR page in Korean. https://ko.wikipedia.org/wiki/EUC-KR I added it as a simili-subtitle file because as the original Mozilla paper says: "The input text may contain extraneous noises which have no relation to its encoding, e.g. HTML tags, non-native words". Therefore I feel it is important to have test files a little noisy if possible, in order to test our resistance to noise in our algorithm.	2015-11-17 16:36:17 +01:00
byvoid	eaab1d7868	Set permissions.	2011-07-11 18:08:26 +08:00
BYVoid	86b4739e5a	Add test cases.	2011-07-11 14:57:31 +08:00

1 2

68 Commits