From d26bc965ad820aa999fbb3ccaa2444b560f98d6d Mon Sep 17 00:00:00 2001 From: Jehan Date: Wed, 17 Mar 2021 21:32:49 +0100 Subject: [PATCH] src: drop the SURE_YES confidence for character distribution probers. Some probers are based on character distribution analysis. Though it is still relevant detection logics, we also know that it is a lot less subtle than sequence distribution. Therefore let's give a good confidence for a text passing such analysis, yet not a near perfect one, thus leaving some chance for other probers. In particular, we can definitely consider that if some text gets over 0.7 on sequence distribution analysis, this is a very likely candidate. I had the case with the Finnish UTF-8 test which was passing (UTF-8, Finnish) detection with a staggering 0.86 confidence, yet was overrided by UHC (EUC-KR). This used to not be a problem when nsMBCSGroupProber would check the UTF-8 prober first and stop there with just some basic encoding detection. Now that we go further and return all relevant candidates, some simpler detection algorithm which always return too-good confidence is not the best idea. --- src/CharDistribution.cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/CharDistribution.cpp b/src/CharDistribution.cpp index 488d9bc..dbe6fc3 100644 --- a/src/CharDistribution.cpp +++ b/src/CharDistribution.cpp @@ -43,7 +43,7 @@ #include "EUCTWFreq.tab" #include "GB2312Freq.tab" -#define SURE_YES 0.99f +#define SURE_YES 0.7f #define SURE_NO 0.01f //return confidence base on received data