From 4f1c3ff85edaa78c841e976261407211295c4169 Mon Sep 17 00:00:00 2001 From: Jehan Date: Mon, 30 Nov 2015 19:52:07 +0100 Subject: [PATCH] nsSBCharSetProber: multiply confidence by ratio of positive seqs per chars. If all sequences in a text are positive sequences, the ratio of positive sequences cannot make the difference between 2 very close charsets. A ratio of positive sequences per letters on the other hand will change a tie between 2 encoding. If while adding a letter, the number of positive sequences does not increase, the confidence will decrease (corresponding to the fact it was likely not a letter). On the other hand, if the number of positive sequences increase, so will the confidence. For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15. When letters only available in ISO-8859-15 appear in a text, we expect confidence to tilt towards the close yet slightly different ISO-8859-15. --- src/nsSBCharSetProber.cpp | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/src/nsSBCharSetProber.cpp b/src/nsSBCharSetProber.cpp index f333454..9c447e0 100644 --- a/src/nsSBCharSetProber.cpp +++ b/src/nsSBCharSetProber.cpp @@ -102,6 +102,15 @@ float nsSingleByteCharSetProber::GetConfidence(void) if (mTotalSeqs > 0) { r = ((float)1.0) * mSeqCounters[POSITIVE_CAT] / mTotalSeqs / mModel->mTypicalPositiveRatio; + /* Multiply by a ratio of positive sequences per characters. + * This would help in particular to distinguish close winners. + * Indeed if you add a letter, you'd expect the positive sequence count + * to increase as well. If it doesn't, it may mean that this new codepoint + * may not have been a letter, but instead a symbol (or some other + * character). This could make the difference between very closely related + * charsets used for the same language. + */ + r = r*mSeqCounters[POSITIVE_CAT] / mTotalChar; r = r*mFreqChar/mTotalChar; if (r >= (float)1.00) r = (float)0.99;