uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-08 18:06:40 +08:00

Author	SHA1	Message	Date
Jehan	f8752f2b56	src, script: add concept of alphabet_mapping in language models. This allows to handle cases where some characters are actually alternative/variants of another. For instance, a same word can be written with both variants, while both are considered correct and equivalent. Browsing a bit Slovenian Wikipedia, it looks like they only use them for titles there. I use this the first time on characters with diacritics in Slovene. Indeed these are so rarely used that they would hardly show in the stats and worse, any sequence using these in tested text would likely show as negative sequences hence drop the confidence in Slovenian. As a consequence, various Slovene text would show up as Slovak as it's close enough and contains the same character with diacritics in a common way.	2021-03-21 15:54:24 +01:00
Jehan	5fe9a7e1df	script: regenerate Slovak and Slovene with better alphabet support. I was missing some characters, especially in the Slovak alphabet. Oppositely the Slovene alphabet does not use 4 of the common ASCII alphabet.	2021-03-21 13:30:41 +01:00
Jehan	5c3a2e8037	src, script: regenerate all existing language models. Now making sure that we have a generic language model working with UTF-8 for all 26 supported models which had single-byte encoding support until now.	2021-03-17 02:07:17 +01:00
Jehan	911695f682	src: new API to get the detected language. This doesn't work for all probers yet, in particular not for the most generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL. It's still a good first step. Right now, it returns the 2-character language code from ISO 639-1. A using project could easily get the English language name from the XML/json files provided by the iso-codes project. This project will also allow to easily localize the language name in other languages through gettext (this is what we do in GIMP for instance). I don't add any dependency though and leave it to downstream projects to implement this. I was also wondering if we want to support region information for cases when it would make sense. I especially wondered about it for Chinese encodings as some of them seem quite specific to a region (according to Wikipedia at least). For the time being though, these just return "zh". We'll see later if it makes sense to be more accurate (maybe depending on reports?).	2021-03-14 00:12:30 +01:00
Jehan	d62154bd6e	LangModels: add Slovene support. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and MAC-CENTRALEUROPE. Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet	2016-09-28 22:13:17 +02:00

5 Commits