uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-13 15:10:06 +08:00

Author	SHA1	Message	Date
Jehan	7f99b91388	src: new weight concept in the C API. Pretty basic, you can weight prefered language and this will impact the result. Say the algorithm "hesitates" between encoding E1 in language L1 and encoding E2 in language L2. By setting L2 with a 1.1 weight, for instance because this is the OS language, or usual prefered language, you may help the algorithm to overcome very tight cases. It can also be helpful when you already know for sure the language of a document, you just don't know its encoding. Then you may set a very high value for this language, or simply set a default value of 0, and set 1 for this language. Only relevant encoding will be taken into account. This is still limited though as generic encoding are still implemented language-agnostic. UTF-8 for instance would be disadvantaged by this weight system until we make it language-aware.	2021-03-14 00:12:30 +01:00
Jehan	911695f682	src: new API to get the detected language. This doesn't work for all probers yet, in particular not for the most generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL. It's still a good first step. Right now, it returns the 2-character language code from ISO 639-1. A using project could easily get the English language name from the XML/json files provided by the iso-codes project. This project will also allow to easily localize the language name in other languages through gettext (this is what we do in GIMP for instance). I don't add any dependency though and leave it to downstream projects to implement this. I was also wondering if we want to support region information for cases when it would make sense. I especially wondered about it for Chinese encodings as some of them seem quite specific to a region (according to Wikipedia at least). For the time being though, these just return "zh". We'll see later if it makes sense to be more accurate (maybe depending on reports?).	2021-03-14 00:12:30 +01:00
Jehan	4da22cca97	src: new API to get all candidates and their confidence. Adding: - uchardet_get_candidates() - uchardet_get_encoding() - uchardet_get_confidence() Also deprecating uchardet_get_charset() to have developers look at the new API instead. I was unsure if this should really get deprecated as it makes the basic case simple, but the new API is just as easy anyway. You can also directly call uchardet_get_encoding() with candidate 0 (same as uchardet_get_charset(), it would then return "" when no candidate was found).	2021-03-14 00:12:30 +01:00
Ilya Tumaykin	6db8b6f8fe	cmake: minor comment cleanups	2016-03-22 01:23:06 +03:00
Ilya Tumaykin	1a1f4bfbd8	cmake: rename UCHARDET_{TARGET -> LIBRARY} for clarity	2016-03-22 01:23:03 +03:00
Ilya Tumaykin	b44be77be6	cmake: uniform indent everywhere Indent with tabs, remove leading/trailing blank lines and spaces.	2016-03-21 01:07:41 +03:00
Ricardo Constantino (:RiCON)	78b55ec9fe	CMake: Fix regression in f53cb8c building in paths with spaces Tested with Ninja and Make in Windows and Archlinux with paths with and without spaces.	2016-03-18 03:37:12 +00:00
Ricardo Constantino (:RiCON)	f53cb8cddd	CMake: fix linking with Ninja	2016-03-16 14:17:47 +00:00
nu774	ba6679f2b3	fix: export symbols were not passed to the linker as intended	2015-06-20 12:28:01 +09:00
BYVoid	3601900164	Initial release.	2011-07-10 15:04:42 +08:00

10 Commits