src: drop the SURE_YES confidence for character distribution probers.

Some probers are based on character distribution analysis. Though it is
still relevant detection logics, we also know that it is a lot less
subtle than sequence distribution.

Therefore let's give a good confidence for a text passing such analysis,
yet not a near perfect one, thus leaving some chance for other probers.
In particular, we can definitely consider that if some text gets over
0.7 on sequence distribution analysis, this is a very likely candidate.

I had the case with the Finnish UTF-8 test which was passing (UTF-8,
Finnish) detection with a staggering 0.86 confidence, yet was overrided
by UHC (EUC-KR). This used to not be a problem when nsMBCSGroupProber
would check the UTF-8 prober first and stop there with just some basic
encoding detection. Now that we go further and return all relevant
candidates, some simpler detection algorithm which always return
too-good confidence is not the best idea.
This commit is contained in:
Jehan 2021-03-17 21:32:49 +01:00
parent b00c85a6a6
commit a7c5a167a9

View File

@ -43,7 +43,7 @@
#include "EUCTWFreq.tab"
#include "GB2312Freq.tab"
#define SURE_YES 0.99f
#define SURE_YES 0.7f
#define SURE_NO 0.01f
//return confidence base on received data