From bed459c6e75e8a5be59ccd9bc80ac76c0bb8dbeb Mon Sep 17 00:00:00 2001
From: Jehan <jehan@girinstud.io>
Date: Sun, 23 May 2021 17:04:37 +0200
Subject: [PATCH] src: drop less of UTF-8 confidence even with few
 non-multibyte chars.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Some languages are not meant to have multibyte characters. For instance,
English would typically have none. Yet you can still have UTF-8 English
text (with a few special characters, or foreign words…). So anyway let's
make it less of a deal breaker.

To be even fairer, the whole logics is biased of course and I believe
that eventually we should get rid of these lines of code dropping
confidence on a number of character. This is a ridiculous rule (we base
on our whole logics on language statistics and suddenly we add some
weird rule with a completely random number). But for now, I'll keep this
as-is until we make the whole library even more robust.
---
 src/nsUTF8Prober.cpp | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/nsUTF8Prober.cpp b/src/nsUTF8Prober.cpp
index 21f885e..6618bec 100644
--- a/src/nsUTF8Prober.cpp
+++ b/src/nsUTF8Prober.cpp
@@ -99,12 +99,13 @@ nsProbingState nsUTF8Prober::HandleData(const char* aBuf, PRUint32 aLen,
 
 float nsUTF8Prober::GetConfidence(int candidate)
 {
-  float unlike = (float)0.99;
-
   if (mNumOfMBChar < 6)
   {
+    float unlike = 0.5f;
+
     for (PRUint32 i = 0; i < mNumOfMBChar; i++)
       unlike *= ONE_CHAR_PROB;
+
     return (float)1.0 - unlike;
   }
   else