Jehan 401eb55dfc src: improve algorithm for confidence computation.
Additionally to the "frequent characters" concept, we add 2
sub-categories, which are the "very frequent characters" and "rare
characters". The former are usually just a few characters which are used
most of the time (like 3 or 4 characters used 40% of the time!), whereas
the later are often a dozen or more characters which are barely used a
few percents of the time, all together.

We use this additional concept to help distinguish very similar
languages, or languages whose frequent characters are a subset of
the ones from another language (typically English, whose alphabet is a
subset of many other European languages).

The mTypicalPositiveRatio is getting rid of, as it was anyway barely of
any use (it was 0.99-something for nearly all languages!). Instead we
get these 2 new ratios: veryFreqRatio and lowFreqRatio, and of course
the associated order counts to know which character are in these sets.
2022-12-14 20:02:59 +01:00
..
LangModels script, src: rebuild the English model. 2022-12-14 00:36:02 +01:00
tools src: add a --language|-l option to the uchardet CLI tool. 2022-12-14 00:24:53 +01:00
Big5Freq.tab Initial release. 2011-07-10 15:04:42 +08:00
CharDistribution.cpp add charset prober for Johab Korean 2022-12-14 00:23:13 +01:00
CharDistribution.h src: build new charset prober for Johab Korean. 2022-12-14 00:23:13 +01:00
CMakeLists.txt script, src: add English language model. 2022-12-14 00:24:53 +01:00
EUCKRFreq.tab Initial release. 2011-07-10 15:04:42 +08:00
EUCTWFreq.tab Fix global-buffer-overflow due EUCTW_TABLE_SIZE 2020-04-22 17:06:40 +00:00
GB2312Freq.tab Initial release. 2011-07-10 15:04:42 +08:00
JISFreq.tab Initial release. 2011-07-10 15:04:42 +08:00
JohabFreq.tab src: build new charset prober for Johab Korean. 2022-12-14 00:23:13 +01:00
JpCntx.cpp Fixes boolean operation precedence warnings... 2015-11-18 19:38:12 +01:00
JpCntx.h Update code from upstream. 2011-07-11 14:42:50 +08:00
nsBig5Prober.cpp src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsBig5Prober.h src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsCharSetProber.cpp src: cast value to its proper type. 2017-08-27 13:01:30 +02:00
nsCharSetProber.h src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsCJKDetector.cpp src: new nsCJKDetector specifically Chinese/Japanese/Korean recognition. 2022-12-14 00:24:53 +01:00
nsCJKDetector.h src: new nsCJKDetector specifically Chinese/Japanese/Korean recognition. 2022-12-14 00:24:53 +01:00
nsCodingStateMachine.h add charset prober for Johab Korean 2022-12-14 00:23:13 +01:00
nscore.h Update code from upstream. 2011-07-11 14:42:50 +08:00
nsEscCharsetProber.cpp src: nsEscCharsetProber also returns the correct language. 2022-12-14 00:23:13 +01:00
nsEscCharsetProber.h src: nsEscCharsetProber also returns the correct language. 2022-12-14 00:23:13 +01:00
nsEscSM.cpp src: nsEscCharsetProber also returns the correct language. 2022-12-14 00:23:13 +01:00
nsEUCJPProber.cpp src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsEUCJPProber.h src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsEUCKRProber.cpp src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsEUCKRProber.h src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsEUCTWProber.cpp src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsEUCTWProber.h src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsGB2312Prober.cpp src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsGB2312Prober.h src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsHebrewProber.cpp src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsHebrewProber.h src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsJohabProber.cpp src, test: fix the new Johab prober and add a test. 2022-12-14 00:23:13 +01:00
nsJohabProber.h src, test: fix the new Johab prober and add a test. 2022-12-14 00:23:13 +01:00
nsLanguageDetector.cpp src: improve algorithm for confidence computation. 2022-12-14 20:02:59 +01:00
nsLanguageDetector.h src: improve algorithm for confidence computation. 2022-12-14 20:02:59 +01:00
nsLatin1Prober.cpp src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsLatin1Prober.h src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsMBCSGroupProber.cpp src: when checking for candidates, make sure we haven't any unprocessed… 2022-12-14 08:39:49 +01:00
nsMBCSGroupProber.h script, src: update Norwegian model with the new language features. 2022-12-14 00:24:53 +01:00
nsMBCSSM.cpp add charset prober for Johab Korean 2022-12-14 00:23:13 +01:00
nsPkgInt.h Update code from upstream. 2011-07-11 14:42:50 +08:00
nsSBCharSetProber.cpp src: improve confidence computation (generic and single-byte charset). 2022-12-14 00:24:53 +01:00
nsSBCharSetProber.h script, src: add English language model. 2022-12-14 00:24:53 +01:00
nsSBCSGroupProber.cpp script, src: rebuild the Danish model. 2022-12-14 00:24:53 +01:00
nsSBCSGroupProber.h script, src: rebuild the Danish model. 2022-12-14 00:24:53 +01:00
nsSJISProber.cpp src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsSJISProber.h src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
nsUniversalDetector.cpp src: reset shortcut charset/language on Reset(). 2022-12-14 00:24:53 +01:00
nsUniversalDetector.h src: nsEscCharsetProber also returns the correct language. 2022-12-14 00:23:13 +01:00
nsUTF8Prober.cpp src: drop less of UTF-8 confidence even with few non-multibyte chars. 2022-12-14 00:24:53 +01:00
nsUTF8Prober.h src: allow for nsCharSetProber to return several candidates. 2022-12-14 00:23:13 +01:00
prmem.h Initial release. 2011-07-10 15:04:42 +08:00
symbols.cmake src, test: rename s/uchardet_get_candidates/uchardet_get_n_candidates/. 2022-12-14 00:24:53 +01:00
uchardet.cpp src, test: rename s/uchardet_get_candidates/uchardet_get_n_candidates/. 2022-12-14 00:24:53 +01:00
uchardet.h src, test: rename s/uchardet_get_candidates/uchardet_get_n_candidates/. 2022-12-14 00:24:53 +01:00