src: fix negative confidence wrapping around because of unsigned int.

In extreme case of more mCtrlChar than mTotalChar (since the later does
not include control characters), we end up with a negative value, which
in unsigned int becomes a huge integer. So because the confidence was so
bad that it would be negative, we ended up in a huge confidence.

We had this case with our Japanese UTF-8 test file which ended up
identified as French ISO-8859-1. So I just cast the uint to float early
on in order to avoid such pitfall.

Now all our test cases succeed again, this time with full UTF-8+language
support! Wouhou!
This commit is contained in:
Jehan 2021-03-20 23:02:10 +01:00
parent 4ef378ce2e
commit e6b4811c9b

View File

@ -130,7 +130,7 @@ float nsSingleByteCharSetProber::GetConfidence(int candidate)
/* The more control characters (proportionnaly to the size of the text), the
* less confident we become in the current charset.
*/
r = r * (mTotalChar - mCtrlChar) / mTotalChar;
r = r * ((float) mTotalChar - mCtrlChar) / mTotalChar;
r = r*mFreqChar/mTotalChar;
if (r >= (float)1.00)
r = (float)0.99;