mirror of
https://gitlab.freedesktop.org/uchardet/uchardet.git
synced 2025-12-08 01:36:41 +08:00
src: drop the SURE_YES confidence for character distribution probers.
Some probers are based on character distribution analysis. Though it is still relevant detection logics, we also know that it is a lot less subtle than sequence distribution. Therefore let's give a good confidence for a text passing such analysis, yet not a near perfect one, thus leaving some chance for other probers. In particular, we can definitely consider that if some text gets over 0.7 on sequence distribution analysis, this is a very likely candidate. I had the case with the Finnish UTF-8 test which was passing (UTF-8, Finnish) detection with a staggering 0.86 confidence, yet was overrided by UHC (EUC-KR). This used to not be a problem when nsMBCSGroupProber would check the UTF-8 prober first and stop there with just some basic encoding detection. Now that we go further and return all relevant candidates, some simpler detection algorithm which always return too-good confidence is not the best idea.
This commit is contained in:
parent
b00c85a6a6
commit
a7c5a167a9
@ -43,7 +43,7 @@
|
||||
#include "EUCTWFreq.tab"
|
||||
#include "GB2312Freq.tab"
|
||||
|
||||
#define SURE_YES 0.99f
|
||||
#define SURE_YES 0.7f
|
||||
#define SURE_NO 0.01f
|
||||
|
||||
//return confidence base on received data
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user