7 Commits

Author SHA1 Message Date
Jehan
bed459c6e7 src: drop less of UTF-8 confidence even with few non-multibyte chars.
Some languages are not meant to have multibyte characters. For instance,
English would typically have none. Yet you can still have UTF-8 English
text (with a few special characters, or foreign words…). So anyway let's
make it less of a deal breaker.

To be even fairer, the whole logics is biased of course and I believe
that eventually we should get rid of these lines of code dropping
confidence on a number of character. This is a ridiculous rule (we base
on our whole logics on language statistics and suddenly we add some
weird rule with a completely random number). But for now, I'll keep this
as-is until we make the whole library even more robust.
2022-12-14 00:24:53 +01:00
Jehan
b00c85a6a6 src: do not shortcut UTF-8 detection too early.
I had the case with the Czech test which was considered as Irish after
being shortcutted far too early after only 16 characters. Confidence
values was just barely above 0.5 for Irish (and barely below for Czech).

By adding a threshold (at least 256 characters), we give a bit of
relevant data to the engine to actually make an informed decision. By
then, the Czech detection was at more than 0.7, whereas the Irish one at
0.6.
2022-12-14 00:23:13 +01:00
Jehan
2127f4fc0d src: allow for nsCharSetProber to return several candidates.
No functional change yet because all probers still return 1 candidate.
Yet now we add a GetCandidates() method to return a number of
candidates.
GetCharSetName(), GetLanguage() and GetConfidence() now take a parameter
which is the candidate index (which must be below the return value of
GetCandidates()). We can now consider that nsCharSetProber computes a
couple (charset, language) and that the confidence is for this specific
couple, not just the confidence for charset detection.
2022-12-14 00:23:13 +01:00
Jehan
5257fc1abf Using the generic language detector in UTF-8 detection.
Now the UTF-8 prober would not only detect valid UTF-8, but would also
detect the most probable language. Using the data generated 2 commits
away, this works very well.

This is still basic and will require even more improvements. In
particular, now the nsUTF8Prober should return an array of ("UTF-8",
language) couple candidate. And nsMBCSGroupProber should itself forward
these candidates as well as other candidates from other multi-byte
detectors. This way, the public-facing API would get more probable
candidates, in case the algorithm is slightly wrong.

Also the UTF-8 confidence is currently stupidly high as soon as we
consider it to be right. We should likely weigh it with language
detection (in particular, if no language is detected, this should
severely weigh down UTF-8 detection; not to 0, but high enough to be a
fallback in case no other encoding+lang is valid and low enough to give
chances to other good candidate couples.
2022-12-14 00:23:13 +01:00
Jehan
53f7ad0e0b Bug 101032 - assignments to nsSMState in nsCodingStateMachine result...
... in unspecified behavior.
When compiling with UBSan (-fsanitize=undefined), execution complains:
> runtime error: load of value 5, which is not a valid value for type 'nsSMState'
Since the machine states depend on every different charset's state
machine, it is not possible to simply extend the enum with more generic
values. Instead let's just make the state as an unsigned int value and
define the 3 generic states as constants.
2017-05-28 20:01:06 +02:00
BYVoid
84284eccf4 Update code from upstream. 2011-07-11 14:42:50 +08:00
BYVoid
3601900164 Initial release. 2011-07-10 15:04:42 +08:00