Jehan 5257fc1abf Using the generic language detector in UTF-8 detection.
Now the UTF-8 prober would not only detect valid UTF-8, but would also
detect the most probable language. Using the data generated 2 commits
away, this works very well.

This is still basic and will require even more improvements. In
particular, now the nsUTF8Prober should return an array of ("UTF-8",
language) couple candidate. And nsMBCSGroupProber should itself forward
these candidates as well as other candidates from other multi-byte
detectors. This way, the public-facing API would get more probable
candidates, in case the algorithm is slightly wrong.

Also the UTF-8 confidence is currently stupidly high as soon as we
consider it to be right. We should likely weigh it with language
detection (in particular, if no language is detected, this should
severely weigh down UTF-8 detection; not to 0, but high enough to be a
fallback in case no other encoding+lang is valid and low enough to give
chances to other good candidate couples.
2022-12-14 00:23:13 +01:00
..
LangModels Rebuild a bunch of language models. 2022-12-14 00:23:13 +01:00
tools src: add a --weight option to the CLI tool. 2022-12-14 00:23:13 +01:00
Big5Freq.tab Initial release. 2011-07-10 15:04:42 +08:00
CharDistribution.cpp Update code from upstream. 2011-07-11 14:42:50 +08:00
CharDistribution.h uchardet_get_charset() must return iconv-compatible names. 2015-11-17 16:15:21 +01:00
CMakeLists.txt New generic language detector class. 2022-12-14 00:23:13 +01:00
EUCKRFreq.tab Initial release. 2011-07-10 15:04:42 +08:00
EUCTWFreq.tab Fix global-buffer-overflow due EUCTW_TABLE_SIZE 2020-04-22 17:06:40 +00:00
GB2312Freq.tab Initial release. 2011-07-10 15:04:42 +08:00
JISFreq.tab Initial release. 2011-07-10 15:04:42 +08:00
JpCntx.cpp Fixes boolean operation precedence warnings... 2015-11-18 19:38:12 +01:00
JpCntx.h Update code from upstream. 2011-07-11 14:42:50 +08:00
nsBig5Prober.cpp Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsBig5Prober.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsCharSetProber.cpp src: cast value to its proper type. 2017-08-27 13:01:30 +02:00
nsCharSetProber.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsCodingStateMachine.h Bug 101032 - assignments to nsSMState in nsCodingStateMachine result... 2017-05-28 20:01:06 +02:00
nscore.h Update code from upstream. 2011-07-11 14:42:50 +08:00
nsEscCharsetProber.cpp Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsEscCharsetProber.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsEscSM.cpp Bug 101030 - Buffer overflow related to ISO2022JP detection in... 2017-05-14 19:49:01 +02:00
nsEUCJPProber.cpp Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsEUCJPProber.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsEUCKRProber.cpp Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsEUCKRProber.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsEUCTWProber.cpp Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsEUCTWProber.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsGB2312Prober.cpp Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsGB2312Prober.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsHebrewProber.cpp Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsHebrewProber.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsLanguageDetector.cpp New generic language detector class. 2022-12-14 00:23:13 +01:00
nsLanguageDetector.h New generic language detector class. 2022-12-14 00:23:13 +01:00
nsLatin1Prober.cpp Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsLatin1Prober.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsMBCSGroupProber.cpp Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsMBCSGroupProber.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsMBCSSM.cpp uchardet_get_charset() must return iconv-compatible names. 2015-11-17 16:15:21 +01:00
nsPkgInt.h Update code from upstream. 2011-07-11 14:42:50 +08:00
nsSBCharSetProber.cpp Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsSBCharSetProber.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsSBCSGroupProber.cpp Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsSBCSGroupProber.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsSJISProber.cpp Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsSJISProber.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsUniversalDetector.cpp Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsUniversalDetector.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsUTF8Prober.cpp Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
nsUTF8Prober.h Using the generic language detector in UTF-8 detection. 2022-12-14 00:23:13 +01:00
prmem.h Initial release. 2011-07-10 15:04:42 +08:00
symbols.cmake src: new weight concept in the C API. 2022-12-14 00:23:13 +01:00
uchardet.cpp src: new weight concept in the C API. 2022-12-14 00:23:13 +01:00
uchardet.h src: new weight concept in the C API. 2022-12-14 00:23:13 +01:00