src: tweak again the language detection confidence.

Computing a logical number of sequence was a big mistake. In particular, a language with only positive sequence would have the same score as a language with a mix of only positive and probable sequence (i.e. 1.0). Instead, just use the real number of sequence, but probable of sequence don't bring +1 to the numerator. Also drop the mTypicalPositiveRatio, at least for now. In my tests, it mostly made results worse. Maybe this would still make sense for language with a huge number of characters (like CJK languages), for which we won't have the full list of characters in our "frequent" list of characters. Yet for most other languages, we actually list all the possible sequences within the character set, therefore any sequence out of our sequence list should necessarily drop confidence. Tweaking the result backup up with some ratio is therefore counter-productive. As for CJK cases, we'll see how to handle the much higher number of sequences (too many to list them all) when we get there.
2026-02-06 09:49:59 +08:00 · 2021-03-17 12:51:25 +01:00 · 2021-03-17 12:51:25 +01:00 · 714ae9ca29
commit 714ae9ca29
parent 26ed628061
1 changed files with 9 additions and 13 deletions
--- a/src/nsLanguageDetector.cpp
+++ b/src/nsLanguageDetector.cpp
@ -116,21 +116,17 @@ float nsLanguageDetector::GetConfidence(void)
  float r;

  if (mTotalSeqs > 0) {
-    /* Create a "logical" number of sequences rather than real, but
-     * weighing the various sequences.
-     * Basically positive sequences will boost the confidence, probable
-     * sequence a bit, but not so much, neutral sequences will not be
-     * integrated in the confidence.
-     * Negative sequences will negatively impact the confidence as much
-     * as positive sequence positively impact it.
+    /* Positive sequences will boost the confidence, probable sequence
+     * only a bit but not so much, neutral sequences will stall the
+     * confidence.
+     * Negative sequences will negatively impact the confidence.
     */
-    int positiveSeqs = mSeqCounters[LANG_POSITIVE_CAT] * 4;
-    int probableSeqs = mSeqCounters[LANG_PROBABLE_CAT];
-    int neutralSeqs  = mSeqCounters[LANG_NEUTRAL_CAT];
-    int negativeSeqs = mSeqCounters[LANG_NEGATIVE_CAT] * 4;
-    int totalSeqs    = positiveSeqs + probableSeqs + neutralSeqs + negativeSeqs;
+    float positiveSeqs = mSeqCounters[LANG_POSITIVE_CAT];
+    float probableSeqs = mSeqCounters[LANG_PROBABLE_CAT];
+    //float neutralSeqs  = mSeqCounters[LANG_NEUTRAL_CAT];
+    float negativeSeqs = mSeqCounters[LANG_NEGATIVE_CAT];

-    r = ((float)1.0) * (positiveSeqs + probableSeqs - negativeSeqs) / totalSeqs / mModel->mTypicalPositiveRatio;
+    r = (positiveSeqs + probableSeqs / 4 - negativeSeqs * 2) / mTotalSeqs;
    /* The more control characters (proportionnaly to the size of the text), the
     * less confident we become in the current language.
     */