4 Commits

Author SHA1 Message Date
LSY
d72a5c88ce add charset prober for Johab Korean 2022-12-14 00:23:13 +01:00
Jehan
a7c5a167a9 src: drop the SURE_YES confidence for character distribution probers.
Some probers are based on character distribution analysis. Though it is
still relevant detection logics, we also know that it is a lot less
subtle than sequence distribution.

Therefore let's give a good confidence for a text passing such analysis,
yet not a near perfect one, thus leaving some chance for other probers.
In particular, we can definitely consider that if some text gets over
0.7 on sequence distribution analysis, this is a very likely candidate.

I had the case with the Finnish UTF-8 test which was passing (UTF-8,
Finnish) detection with a staggering 0.86 confidence, yet was overrided
by UHC (EUC-KR). This used to not be a problem when nsMBCSGroupProber
would check the UTF-8 prober first and stop there with just some basic
encoding detection. Now that we go further and return all relevant
candidates, some simpler detection algorithm which always return
too-good confidence is not the best idea.
2022-12-14 00:23:13 +01:00
BYVoid
84284eccf4 Update code from upstream. 2011-07-11 14:42:50 +08:00
BYVoid
3601900164 Initial release. 2011-07-10 15:04:42 +08:00