uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-06 16:56:40 +08:00

Author	SHA1	Message	Date
Jehan	db836fad63	script, src: generate more code for language and sequence model listing. Right now, each time we add new language or new charset support, we have too many pieces of code not to forget to edit. The script script/BuildLangModel.py will now take care of the main parts: listing the sequence models, listing the generic language models and computing the numbers for each listing. Furthermore the script will now end with a TODO list of the parts which are still to be done manually (2 functions to edit and a CMakeLists). Finally the script now allows to give a list of languages to edit rather of having to run it with languages one by one. It also allows 2 special code: "none", which will retrain none of the languages, but will re-generate only the new generated listings; and "all" which will retrain all models (useful in particulare when we change the model formats or usage and want to regenerate everything).	2022-12-18 17:23:34 +01:00
Jehan	e6e51d9fe8	src: all language models now rebuilt after the fix.	2022-12-15 14:31:55 +01:00
Jehan	6bb1b3e101	scripts: all language models rebuilt with the new ratio data.	2022-12-14 20:16:44 +01:00
Jehan	b5b75b81ce	script, src: rebuild the Danish model. Now that it has IBM865 support on the main branch and that I rebased, this feature branch for the new API got broken too.	2022-12-14 00:24:53 +01:00
Jehan	b70b1ebf88	Rebuild a bunch of language models. Adding generic language model (see coming commit), which uses the same data as specific single-byte encoding statistics model, except that it applies it to unicode code points. For this to work, instead of the CharToOrderMap which was mapping directly from encoded byte (always 256 values) to order, now we add an array of frequent characters, ordered by generic unicode code points to the order of frequency (which can be used on the same sequence mapping array). This of course means that each prober where we will want to use these generic models will have to implement their own byte to code point decoder, as this is per-encoding logics anyway. This will come in a subsequent commit.	2022-12-14 00:23:13 +01:00
Jehan	5a949265d5	src: new API to get the detected language. This doesn't work for all probers yet, in particular not for the most generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL. It's still a good first step. Right now, it returns the 2-character language code from ISO 639-1. A using project could easily get the English language name from the XML/json files provided by the iso-codes project. This project will also allow to easily localize the language name in other languages through gettext (this is what we do in GIMP for instance). I don't add any dependency though and leave it to downstream projects to implement this. I was also wondering if we want to support region information for cases when it would make sense. I especially wondered about it for Chinese encodings as some of them seem quite specific to a region (according to Wikipedia at least). For the time being though, these just return "zh". We'll see later if it makes sense to be more accurate (maybe depending on reports?).	2022-12-14 00:23:13 +01:00
Jehan	388777be51	script, src, test: add IBM865 support for Danish. Newly added IBM865 charset (for Norwegian) can also be used for Danish By the way, I fixed `script/charsets/ibm865.py` as Danish uses the 'da' ISO 639-1 code by the way, not 'dk' (which is sometimes used for other codes for Denmark, such as ISO 3166 country code and internet TLD) but not for the language itself. For the test, adding some text from the top article of the day on the Danish Wikipedia, which was about Jimi Hendrix. And that's cool! 🎸 ;-)	2022-11-30 19:57:52 +01:00
Jehan	44a50c30ee	Issue #8 : no newline at end of file. Not sure if it is in the C++ standard, or was, but apparently some compilers may complain when files don't end with a newline (though neither GCC nor Clang as our CI and my local builds are fine). So here are all our generated source which didn't have such ending newline (hopefully I forgot none). I just loaded them in my vim editor, and resaved them. This was enough to add an ending newline.	2020-04-22 22:53:25 +02:00
Jehan	923d264470	LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15). Test for ISO-8859-1 is disabled for now since the difference is not big enough, as for characters used in Danish, between ISO-8859-1 and ISO-8859-15. Therefore the first to be declared "wins". Let's see to improve this later. Test contents from: https://da.wikipedia.org/wiki/Eurosymbol https://da.wikipedia.org/wiki/Dansk_%28sprog%29	2016-02-19 19:10:41 +01:00

9 Commits