The gb18030 test fails, reporting the sample text as Macedonian language
encoded with windows-1251. This is because 1: the Macedonian language
model is very optimistic and reports high confidence with the given
sample, and 2: the original sample text is extremely short and lacks
language variety.
By simply adding a good amount of real Chinese literature to the sample
file, the test no longer fails.
This text has been extracted from Wikipedia:
https://zh.wikipedia.org/wiki/%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD
I realize that the language information a text has been written in is
very important since it would completely change the character
distribution. Our test files should take this into account, and we
should create several test files in different languages for encoding
used in various languages.