From 4f1c3ff85edaa78c841e976261407211295c4169 Mon Sep 17 00:00:00 2001
From: Jehan <jehan@girinstud.io>
Date: Mon, 30 Nov 2015 19:52:07 +0100
Subject: [PATCH] nsSBCharSetProber: multiply confidence by ratio of positive
 seqs per chars.

If all sequences in a text are positive sequences, the ratio of positive
sequences cannot make the difference between 2 very close charsets.
A ratio of positive sequences per letters on the other hand will
change a tie between 2 encoding. If while adding a letter, the number
of positive sequences does not increase, the confidence will decrease
(corresponding to the fact it was likely not a letter).
On the other hand, if the number of positive sequences increase, so
will the confidence.
For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15.
When letters only available in ISO-8859-15 appear in a text, we expect
confidence to tilt towards the close yet slightly different ISO-8859-15.
---
 src/nsSBCharSetProber.cpp | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/nsSBCharSetProber.cpp b/src/nsSBCharSetProber.cpp
index f333454..9c447e0 100644
--- a/src/nsSBCharSetProber.cpp
+++ b/src/nsSBCharSetProber.cpp
@@ -102,6 +102,15 @@ float nsSingleByteCharSetProber::GetConfidence(void)
 
   if (mTotalSeqs > 0) {
     r = ((float)1.0) * mSeqCounters[POSITIVE_CAT] / mTotalSeqs / mModel->mTypicalPositiveRatio;
+    /* Multiply by a ratio of positive sequences per characters.
+     * This would help in particular to distinguish close winners.
+     * Indeed if you add a letter, you'd expect the positive sequence count
+     * to increase as well. If it doesn't, it may mean that this new codepoint
+     * may not have been a letter, but instead a symbol (or some other
+     * character). This could make the difference between very closely related
+     * charsets used for the same language.
+     */
+    r = r*mSeqCounters[POSITIVE_CAT] / mTotalChar;
     r = r*mFreqChar/mTotalChar;
     if (r >= (float)1.00)
       r = (float)0.99;