uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-07 17:26:41 +08:00

Author	SHA1	Message	Date
Jehan	cec8817d79	src: new Big5 detection implementation. Rather than using a huge frequency table through some state machine code that I don't even understand, I noticed that the Big5 encoding is from the start organized in frequent and non-frequent characters tables (per Wikipedia page on Big5). This makes it very easy to count characters by just counting which class each character is in. Making a few tests with random Chinese text converted to Big5, it seems to work pretty well (and fix the test which got broken with previous commit), and it doesn't slow down detection in any significant way either. This may be the next step towards improving also the various multi-byte encoding detection, which are still using some coding state generated machines which mostly still elude me.	2022-12-19 00:01:12 +01:00
Jehan	2127f4fc0d	src: allow for nsCharSetProber to return several candidates. No functional change yet because all probers still return 1 candidate. Yet now we add a GetCandidates() method to return a number of candidates. GetCharSetName(), GetLanguage() and GetConfidence() now take a parameter which is the candidate index (which must be below the return value of GetCandidates()). We can now consider that nsCharSetProber computes a couple (charset, language) and that the confidence is for this specific couple, not just the confidence for charset detection.	2022-12-14 00:23:13 +01:00
Jehan	5257fc1abf	Using the generic language detector in UTF-8 detection. Now the UTF-8 prober would not only detect valid UTF-8, but would also detect the most probable language. Using the data generated 2 commits away, this works very well. This is still basic and will require even more improvements. In particular, now the nsUTF8Prober should return an array of ("UTF-8", language) couple candidate. And nsMBCSGroupProber should itself forward these candidates as well as other candidates from other multi-byte detectors. This way, the public-facing API would get more probable candidates, in case the algorithm is slightly wrong. Also the UTF-8 confidence is currently stupidly high as soon as we consider it to be right. We should likely weigh it with language detection (in particular, if no language is detected, this should severely weigh down UTF-8 detection; not to 0, but high enough to be a fallback in case no other encoding+lang is valid and low enough to give chances to other good candidate couples.	2022-12-14 00:23:13 +01:00
Jehan	53f7ad0e0b	Bug 101032 - assignments to nsSMState in nsCodingStateMachine result... ... in unspecified behavior. When compiling with UBSan (-fsanitize=undefined), execution complains: > runtime error: load of value 5, which is not a valid value for type 'nsSMState' Since the machine states depend on every different charset's state machine, it is not possible to simply extend the enum with more generic values. Instead let's just make the state as an unsigned int value and define the 3 generic states as constants.	2017-05-28 20:01:06 +02:00
BYVoid	84284eccf4	Update code from upstream.	2011-07-11 14:42:50 +08:00
BYVoid	3601900164	Initial release.	2011-07-10 15:04:42 +08:00

6 Commits