uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-08 09:56:41 +08:00

Author	SHA1	Message	Date
Jehan	4dee1a747d	src, script: fix the order of characters for Vietnamese. Cf. commit 872294d.	2021-03-21 16:02:03 +01:00
Jehan	7439766ece	script, src: regenerate the Vietnamese model. The alphabet was not complete and thus confidence was a bit too low. For instance the VISCII test case's confidence bumped from 0.643401 to 0.696346 and the UTF-8 test case bumped from 0.863777 to 0.99. Only the Windows-1258 test case is slightly worse from 0.532846 to 0.532098. But the overwhole recognition gain is obvious anyway.	2021-03-21 01:17:55 +01:00
Jehan	5c3a2e8037	src, script: regenerate all existing language models. Now making sure that we have a generic language model working with UTF-8 for all 26 supported models which had single-byte encoding support until now.	2021-03-17 02:07:17 +01:00
Jehan	98b5e52252	LangModels: add VISCII encoding support and retrain Vietnamese model.	2016-02-13 03:51:18 +01:00
Jehan	178c6119b8	LangModels: add Windows-1258 support for Vietnamese. I was planning on adding VISCII support as well, but Python encode() method does not have any support for it apparently, so I cannot generate the proper statistics data with the current version of the string.	2016-02-13 02:32:57 +01:00