4 Commits

Author SHA1 Message Date
Jehan
41d309e8a2 script, src: regenerate Russian models and add UTF-8/Russian support.
This fixes the broken Russian test in Windows-1251 which once again gets
a much better score with Russian. Also this adds UTF-8 support.

Same as Bulgarian, I wonder why I had not regenerated this earlier.

The new UTF-8 test comes from the 'Сурки' page of Wikipedia in Russian.

Note that now this broke the test zh:gb18030 (the score for KOI8-R / ru
(0.766388) beats GB18030 / zh (0.700000)). I think I'll have to look a
bit closer at our GB18030 dedicated prober.
2022-12-17 21:41:11 +01:00
Jehan
942ac05ff5 Add some Russian test files.
Texts from:
IBM855: https://ru.wikipedia.org/wiki/CP855
IBM866: https://ru.wikipedia.org/wiki/Альтернативная_кодировка
MAC-CYRILLIC: https://ru.wikipedia.org/wiki/MacCyrillic
2015-11-27 18:17:20 +01:00
Jehan
0d70a36910 Adding some more test files for Russian and Chinese.
Taken from:
https://zh.wikipedia.org/wiki/EUC
https://ru.wikipedia.org/wiki/КОИ-8
And rename a file s/utf8.txt/utf-8.txt/ to fix a build test.
2015-11-18 19:27:38 +01:00
Jehan
0efcdfa546 Reorganize test files in language subdirectories.
I realize that the language information a text has been written in is
very important since it would completely change the character
distribution. Our test files should take this into account, and we
should create several test files in different languages for encoding
used in various languages.
2015-11-17 21:12:39 +01:00