src: drop less of UTF-8 confidence even with few non-multibyte chars.

Some languages are not meant to have multibyte characters. For instance,
English would typically have none. Yet you can still have UTF-8 English
text (with a few special characters, or foreign words…). So anyway let's
make it less of a deal breaker.

To be even fairer, the whole logics is biased of course and I believe
that eventually we should get rid of these lines of code dropping
confidence on a number of character. This is a ridiculous rule (we base
on our whole logics on language statistics and suddenly we add some
weird rule with a completely random number). But for now, I'll keep this
as-is until we make the whole library even more robust.
This commit is contained in:
Jehan 2021-05-23 17:04:37 +02:00
parent bffb7819d2
commit bed459c6e7

View File

@ -99,12 +99,13 @@ nsProbingState nsUTF8Prober::HandleData(const char* aBuf, PRUint32 aLen,
float nsUTF8Prober::GetConfidence(int candidate)
{
float unlike = (float)0.99;
if (mNumOfMBChar < 6)
{
float unlike = 0.5f;
for (PRUint32 i = 0; i < mNumOfMBChar; i++)
unlike *= ONE_CHAR_PROB;
return (float)1.0 - unlike;
}
else