uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-02-07 18:26:51 +08:00

Author	SHA1	Message	Date
Jehan	fe7bf3e994	test: update UTF-16 and UTF-32 tests after label changing.	2015-12-04 19:46:51 +01:00
Jehan	5d3fb3dc2f	test: add a Windows-1252 French test. Text from https://fr.wikipedia.org/wiki/Œuf_(cuisine)	2015-12-03 21:20:15 +01:00
Jehan	9dd6b34e93	test: add French UTF-8 test. Text from: https://fr.wikipedia.org/wiki/UTF-8	2015-11-30 20:03:33 +01:00
Jehan	04f9309932	tests: update ISO-8859-15 French test file. Previous technical text about charsets themselves were not relevant to identify a language. In particular the special characters different between ISO-8859-1 and ISO-8859-15 were used by themselves, out of a char sequence context. Therefore without language understanding, they could have as well been representing the ISO-8859-15 letters or the ISO-8859-1 symbols at the corresponding codepoints. Replacing with text from this Wikipedia page: https://fr.wikipedia.org/wiki/Œuf_(cuisine) This uses some of these same characters (in particular 'œ') but in contextual character sequences, making it relevant for our algorithm.	2015-11-30 00:19:15 +01:00
Jehan	50588ba375	Add a ISO-8859-15 test file for French.	2015-11-28 02:18:57 +01:00
Jehan	7fa0fefef8	Add UTF-16 and UTF-32 test files in French, with BOM. Unfortunately uchardet currently seems unable to detect UTF-16/32 text without a BOM.	2015-11-26 02:45:00 +01:00
Jehan	0efcdfa546	Reorganize test files in language subdirectories. I realize that the language information a text has been written in is very important since it would completely change the character distribution. Our test files should take this into account, and we should create several test files in different languages for encoding used in various languages.	2015-11-17 21:12:39 +01:00

7 Commits