nsSBCharSetProber: multiply confidence by ratio of positive seqs per chars.

If all sequences in a text are positive sequences, the ratio of positive sequences cannot make the difference between 2 very close charsets. A ratio of positive sequences per letters on the other hand will change a tie between 2 encoding. If while adding a letter, the number of positive sequences does not increase, the confidence will decrease (corresponding to the fact it was likely not a letter). On the other hand, if the number of positive sequences increase, so will the confidence. For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15. When letters only available in ISO-8859-15 appear in a text, we expect confidence to tilt towards the close yet slightly different ISO-8859-15.
2026-02-28 03:18:03 +08:00 · 2015-11-30 19:52:07 +01:00 · 2015-11-30 19:52:07 +01:00 · 4f1c3ff85e
commit 4f1c3ff85e
parent 9cb5764b73
1 changed files with 9 additions and 0 deletions
--- a/src/nsSBCharSetProber.cpp
+++ b/src/nsSBCharSetProber.cpp
@ -102,6 +102,15 @@ float nsSingleByteCharSetProber::GetConfidence(void)

  if (mTotalSeqs > 0) {
    r = ((float)1.0) * mSeqCounters[POSITIVE_CAT] / mTotalSeqs / mModel->mTypicalPositiveRatio;
+    /* Multiply by a ratio of positive sequences per characters.
+     * This would help in particular to distinguish close winners.
+     * Indeed if you add a letter, you'd expect the positive sequence count
+     * to increase as well. If it doesn't, it may mean that this new codepoint
+     * may not have been a letter, but instead a symbol (or some other
+     * character). This could make the difference between very closely related
+     * charsets used for the same language.
+     */
+    r = r*mSeqCounters[POSITIVE_CAT] / mTotalChar;
    r = r*mFreqChar/mTotalChar;
    if (r >= (float)1.00)
      r = (float)0.99;