mirror of
https://gitlab.freedesktop.org/uchardet/uchardet.git
synced 2026-01-01 03:12:24 +08:00
nsSBCharSetProber: multiply confidence by ratio of positive seqs per chars.
If all sequences in a text are positive sequences, the ratio of positive sequences cannot make the difference between 2 very close charsets. A ratio of positive sequences per letters on the other hand will change a tie between 2 encoding. If while adding a letter, the number of positive sequences does not increase, the confidence will decrease (corresponding to the fact it was likely not a letter). On the other hand, if the number of positive sequences increase, so will the confidence. For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15. When letters only available in ISO-8859-15 appear in a text, we expect confidence to tilt towards the close yet slightly different ISO-8859-15.
This commit is contained in:
parent
9cb5764b73
commit
4f1c3ff85e
@ -102,6 +102,15 @@ float nsSingleByteCharSetProber::GetConfidence(void)
|
||||
|
||||
if (mTotalSeqs > 0) {
|
||||
r = ((float)1.0) * mSeqCounters[POSITIVE_CAT] / mTotalSeqs / mModel->mTypicalPositiveRatio;
|
||||
/* Multiply by a ratio of positive sequences per characters.
|
||||
* This would help in particular to distinguish close winners.
|
||||
* Indeed if you add a letter, you'd expect the positive sequence count
|
||||
* to increase as well. If it doesn't, it may mean that this new codepoint
|
||||
* may not have been a letter, but instead a symbol (or some other
|
||||
* character). This could make the difference between very closely related
|
||||
* charsets used for the same language.
|
||||
*/
|
||||
r = r*mSeqCounters[POSITIVE_CAT] / mTotalChar;
|
||||
r = r*mFreqChar/mTotalChar;
|
||||
if (r >= (float)1.00)
|
||||
r = (float)0.99;
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user