7 Commits

Author SHA1 Message Date
Jehan
fe7bf3e994 test: update UTF-16 and UTF-32 tests after label changing. 2015-12-04 19:46:51 +01:00
Jehan
5d3fb3dc2f test: add a Windows-1252 French test.
Text from https://fr.wikipedia.org/wiki/Œuf_(cuisine)
2015-12-03 21:20:15 +01:00
Jehan
9dd6b34e93 test: add French UTF-8 test.
Text from:
https://fr.wikipedia.org/wiki/UTF-8
2015-11-30 20:03:33 +01:00
Jehan
04f9309932 tests: update ISO-8859-15 French test file.
Previous technical text about charsets themselves were not relevant
to identify a language. In particular the special characters different
between ISO-8859-1 and ISO-8859-15 were used by themselves, out of a
char sequence context. Therefore without language understanding, they
could have as well been representing the ISO-8859-15 letters or the
ISO-8859-1 symbols at the corresponding codepoints.
Replacing with text from this Wikipedia page:
https://fr.wikipedia.org/wiki/Œuf_(cuisine)
This uses some of these same characters (in particular 'œ') but in
contextual character sequences, making it relevant for our algorithm.
2015-11-30 00:19:15 +01:00
Jehan
50588ba375 Add a ISO-8859-15 test file for French. 2015-11-28 02:18:57 +01:00
Jehan
7fa0fefef8 Add UTF-16 and UTF-32 test files in French, with BOM.
Unfortunately uchardet currently seems unable to detect UTF-16/32
text without a BOM.
2015-11-26 02:45:00 +01:00
Jehan
0efcdfa546 Reorganize test files in language subdirectories.
I realize that the language information a text has been written in is
very important since it would completely change the character
distribution. Our test files should take this into account, and we
should create several test files in different languages for encoding
used in various languages.
2015-11-17 21:12:39 +01:00