uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-10 02:46:40 +08:00

Author	SHA1	Message	Date
Jehan	908f9b8ba7	src, test: rename s/uchardet_get_candidates/uchardet_get_n_candidates/. This was badly named as this function does not return candidates, but the number of candidates (to be actually used in other API).	2022-12-14 00:24:53 +01:00
Jehan	669ede73a3	src: new weight concept in the C API. Pretty basic, you can weight prefered language and this will impact the result. Say the algorithm "hesitates" between encoding E1 in language L1 and encoding E2 in language L2. By setting L2 with a 1.1 weight, for instance because this is the OS language, or usual prefered language, you may help the algorithm to overcome very tight cases. It can also be helpful when you already know for sure the language of a document, you just don't know its encoding. Then you may set a very high value for this language, or simply set a default value of 0, and set 1 for this language. Only relevant encoding will be taken into account. This is still limited though as generic encoding are still implemented language-agnostic. UTF-8 for instance would be disadvantaged by this weight system until we make it language-aware.	2022-12-14 00:23:13 +01:00
Jehan	5a949265d5	src: new API to get the detected language. This doesn't work for all probers yet, in particular not for the most generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL. It's still a good first step. Right now, it returns the 2-character language code from ISO 639-1. A using project could easily get the English language name from the XML/json files provided by the iso-codes project. This project will also allow to easily localize the language name in other languages through gettext (this is what we do in GIMP for instance). I don't add any dependency though and leave it to downstream projects to implement this. I was also wondering if we want to support region information for cases when it would make sense. I especially wondered about it for Chinese encodings as some of them seem quite specific to a region (according to Wikipedia at least). For the time being though, these just return "zh". We'll see later if it makes sense to be more accurate (maybe depending on reports?).	2022-12-14 00:23:13 +01:00
Jehan	8118133e00	src: new API to get all candidates and their confidence. Adding: - uchardet_get_candidates() - uchardet_get_encoding() - uchardet_get_confidence() Also deprecating uchardet_get_charset() to have developers look at the new API instead. I was unsure if this should really get deprecated as it makes the basic case simple, but the new API is just as easy anyway. You can also directly call uchardet_get_encoding() with candidate 0 (same as uchardet_get_charset(), it would then return "" when no candidate was found).	2022-12-14 00:23:13 +01:00
Jehan	15fc8f0a0f	src: now reporting encoding+confidence and keeping a list. Preparing for an updated API which will also allow to loop at the confidence value, as well as get the list of possible candidate (i.e. all detected encoding which had a confidence value high enough so that we would even consider them). It is still only internal logics though.	2022-12-14 00:23:13 +01:00
Jehan	6c7f32a751	Issue #10 : Crashing sequence with nsSJISProber. uchardet_handle_data() should not try to process data of nul length. Still this is not technically an error to feed empty data to the engine, and I could imagine it could happen especially when done in some automatic process with random input files (which looks like what was happening in the reporter case). So feeding empty data just returns a success without actually doing any processing, allowing to continue the data feed.	2020-04-22 22:11:51 +02:00
Jehan	256d1957b2	uchardet_get_charset() should never return NULL... ... to stay backward compatible with previous behavior. About detection failure, our in-code documentation says: "@return name of charset on success and "" on failure or pure ascii." This behavior had been broken by commit 3a518c0, which returned NULL instead. Our command-line tool was the first victim, segfaulting on ASCII files.	2015-11-16 17:33:16 +01:00
Carbo Kuo	016eb18437	Merge pull request #15 from wang-bin/c++abi do not use std::string which breaks c++ abi	2015-11-09 20:04:21 +01:00
wang-bin	3a518c0536	do not use std::string which breaks c++ abi Some stl types can break abi. If the program is built with g++ 5, and libstdc++ on the target platform is g++ 4.x, then it can not run	2015-11-04 18:16:24 +08:00
Jehan	ba97505efc	(void) and () empty arguments are different in C. This fixes the following warning when including uchardet.h in C source, built with -Wstrict-prototypes: `uchardet.h:52:1: warning: function declaration isn't a prototype`	2015-09-05 15:58:56 +02:00
BYVoid	84284eccf4	Update code from upstream.	2011-07-11 14:42:50 +08:00
BYVoid	1b05009d4d	Update contributors information.	2011-07-10 15:43:28 +08:00
BYVoid	1094508286	Dos2unix.	2011-07-10 15:20:41 +08:00
BYVoid	9be8afdfb9	Compelete comments on intercaface.	2011-07-10 15:20:05 +08:00
BYVoid	3601900164	Initial release.	2011-07-10 15:04:42 +08:00

15 Commits