mirror of
https://gitlab.freedesktop.org/uchardet/uchardet.git
synced 2025-12-06 16:56:40 +08:00
The gb18030 test fails, reporting the sample text as Macedonian language encoded with windows-1251. This is because 1: the Macedonian language model is very optimistic and reports high confidence with the given sample, and 2: the original sample text is extremely short and lacks language variety. By simply adding a good amount of real Chinese literature to the sample file, the test no longer fails. This text has been extracted from Wikipedia: https://zh.wikipedia.org/wiki/%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD |
||
|---|---|---|
| .. | ||
| big5.txt | ||
| euc-tw.txt | ||
| gb18030.txt | ||
| utf-8.txt | ||