src: drop the SURE_YES confidence for character distribution probers.

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-08 01:36:41 +08:00

Some probers are based on character distribution analysis. Though it is
still relevant detection logics, we also know that it is a lot less
subtle than sequence distribution.

Therefore let's give a good confidence for a text passing such analysis,
yet not a near perfect one, thus leaving some chance for other probers.
In particular, we can definitely consider that if some text gets over
0.7 on sequence distribution analysis, this is a very likely candidate.

I had the case with the Finnish UTF-8 test which was passing (UTF-8,
Finnish) detection with a staggering 0.86 confidence, yet was overrided
by UHC (EUC-KR). This used to not be a problem when nsMBCSGroupProber
would check the UTF-8 prober first and stop there with just some basic
encoding detection. Now that we go further and return all relevant
candidates, some simpler detection algorithm which always return
too-good confidence is not the best idea.

This commit is contained in:

Jehan

2021-03-17 21:32:49 +01:00

parent b00c85a6a6

commit a7c5a167a9

1 changed files with 1 additions and 1 deletions

									
										2

src/CharDistribution.cpp
									
											View File
											
				@ -43,7 +43,7 @@

				#include "EUCTWFreq.tab"

				#include "GB2312Freq.tab"

				#define SURE_YES 0.99f

				#define SURE_YES 0.7f

				#define SURE_NO  0.01f

				//return confidence base on received data

src: drop the SURE_YES confidence for character distribution probers.

2 src/CharDistribution.cpp Unescape Escape View File

2

src/CharDistribution.cpp

View File