uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-09 02:16:40 +08:00

Author	SHA1	Message	Date
Jehan	4dee1a747d	src, script: fix the order of characters for Vietnamese. Cf. commit 872294d.	2021-03-21 16:02:03 +01:00
Jehan	7439766ece	script, src: regenerate the Vietnamese model. The alphabet was not complete and thus confidence was a bit too low. For instance the VISCII test case's confidence bumped from 0.643401 to 0.696346 and the UTF-8 test case bumped from 0.863777 to 0.99. Only the Windows-1258 test case is slightly worse from 0.532846 to 0.532098. But the overwhole recognition gain is obvious anyway.	2021-03-21 01:17:55 +01:00
Jehan	5c3a2e8037	src, script: regenerate all existing language models. Now making sure that we have a generic language model working with UTF-8 for all 26 supported models which had single-byte encoding support until now.	2021-03-17 02:07:17 +01:00
Jehan	911695f682	src: new API to get the detected language. This doesn't work for all probers yet, in particular not for the most generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL. It's still a good first step. Right now, it returns the 2-character language code from ISO 639-1. A using project could easily get the English language name from the XML/json files provided by the iso-codes project. This project will also allow to easily localize the language name in other languages through gettext (this is what we do in GIMP for instance). I don't add any dependency though and leave it to downstream projects to implement this. I was also wondering if we want to support region information for cases when it would make sense. I especially wondered about it for Chinese encodings as some of them seem quite specific to a region (according to Wikipedia at least). For the time being though, these just return "zh". We'll see later if it makes sense to be more accurate (maybe depending on reports?).	2021-03-14 00:12:30 +01:00
Jehan	44a50c30ee	Issue #8 : no newline at end of file. Not sure if it is in the C++ standard, or was, but apparently some compilers may complain when files don't end with a newline (though neither GCC nor Clang as our CI and my local builds are fine). So here are all our generated source which didn't have such ending newline (hopefully I forgot none). I just loaded them in my vim editor, and resaved them. This was enough to add an ending newline.	2020-04-22 22:53:25 +02:00
Jehan	98b5e52252	LangModels: add VISCII encoding support and retrain Vietnamese model.	2016-02-13 03:51:18 +01:00
Jehan	178c6119b8	LangModels: add Windows-1258 support for Vietnamese. I was planning on adding VISCII support as well, but Python encode() method does not have any support for it apparently, so I cannot generate the proper statistics data with the current version of the string.	2016-02-13 02:32:57 +01:00

7 Commits