mirror of
https://gitlab.freedesktop.org/uchardet/uchardet.git
synced 2025-12-12 06:30:05 +08:00
src: drop less of UTF-8 confidence even with few non-multibyte chars.
Some languages are not meant to have multibyte characters. For instance, English would typically have none. Yet you can still have UTF-8 English text (with a few special characters, or foreign words…). So anyway let's make it less of a deal breaker. To be even fairer, the whole logics is biased of course and I believe that eventually we should get rid of these lines of code dropping confidence on a number of character. This is a ridiculous rule (we base on our whole logics on language statistics and suddenly we add some weird rule with a completely random number). But for now, I'll keep this as-is until we make the whole library even more robust.
This commit is contained in:
parent
bffb7819d2
commit
bed459c6e7
@ -99,12 +99,13 @@ nsProbingState nsUTF8Prober::HandleData(const char* aBuf, PRUint32 aLen,
|
||||
|
||||
float nsUTF8Prober::GetConfidence(int candidate)
|
||||
{
|
||||
float unlike = (float)0.99;
|
||||
|
||||
if (mNumOfMBChar < 6)
|
||||
{
|
||||
float unlike = 0.5f;
|
||||
|
||||
for (PRUint32 i = 0; i < mNumOfMBChar; i++)
|
||||
unlike *= ONE_CHAR_PROB;
|
||||
|
||||
return (float)1.0 - unlike;
|
||||
}
|
||||
else
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user