From d26bc965ad820aa999fbb3ccaa2444b560f98d6d Mon Sep 17 00:00:00 2001
From: Jehan <jehan@girinstud.io>
Date: Wed, 17 Mar 2021 21:32:49 +0100
Subject: [PATCH] src: drop the SURE_YES confidence for character distribution
 probers.

Some probers are based on character distribution analysis. Though it is
still relevant detection logics, we also know that it is a lot less
subtle than sequence distribution.

Therefore let's give a good confidence for a text passing such analysis,
yet not a near perfect one, thus leaving some chance for other probers.
In particular, we can definitely consider that if some text gets over
0.7 on sequence distribution analysis, this is a very likely candidate.

I had the case with the Finnish UTF-8 test which was passing (UTF-8,
Finnish) detection with a staggering 0.86 confidence, yet was overrided
by UHC (EUC-KR). This used to not be a problem when nsMBCSGroupProber
would check the UTF-8 prober first and stop there with just some basic
encoding detection. Now that we go further and return all relevant
candidates, some simpler detection algorithm which always return
too-good confidence is not the best idea.
---
 src/CharDistribution.cpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/CharDistribution.cpp b/src/CharDistribution.cpp
index 488d9bc..dbe6fc3 100644
--- a/src/CharDistribution.cpp
+++ b/src/CharDistribution.cpp
@@ -43,7 +43,7 @@
 #include "EUCTWFreq.tab"
 #include "GB2312Freq.tab"
 
-#define SURE_YES 0.99f
+#define SURE_YES 0.7f
 #define SURE_NO  0.01f
 
 //return confidence base on received data