nsSBCharSetProber: multiply confidence by ratio of positive seqs per chars.

If all sequences in a text are positive sequences, the ratio of positive
sequences cannot make the difference between 2 very close charsets.
A ratio of positive sequences per letters on the other hand will
change a tie between 2 encoding. If while adding a letter, the number
of positive sequences does not increase, the confidence will decrease
(corresponding to the fact it was likely not a letter).
On the other hand, if the number of positive sequences increase, so
will the confidence.
For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15.
When letters only available in ISO-8859-15 appear in a text, we expect
confidence to tilt towards the close yet slightly different ISO-8859-15.
This commit is contained in:
Jehan 2015-11-30 19:52:07 +01:00
parent 9cb5764b73
commit 4f1c3ff85e

View File

@ -102,6 +102,15 @@ float nsSingleByteCharSetProber::GetConfidence(void)
if (mTotalSeqs > 0) {
r = ((float)1.0) * mSeqCounters[POSITIVE_CAT] / mTotalSeqs / mModel->mTypicalPositiveRatio;
/* Multiply by a ratio of positive sequences per characters.
* This would help in particular to distinguish close winners.
* Indeed if you add a letter, you'd expect the positive sequence count
* to increase as well. If it doesn't, it may mean that this new codepoint
* may not have been a letter, but instead a symbol (or some other
* character). This could make the difference between very closely related
* charsets used for the same language.
*/
r = r*mSeqCounters[POSITIVE_CAT] / mTotalChar;
r = r*mFreqChar/mTotalChar;
if (r >= (float)1.00)
r = (float)0.99;