373 Commits

Author SHA1 Message Date
Jehan
b7acffc806 script, src: remove generated statistics data for Korean. 2022-12-14 00:24:53 +01:00
Jehan
b725c0b2ff src: new nsCJKDetector specifically Chinese/Japanese/Korean recognition.
I was pondering improving the logics of the LanguageModel contents, in
order to better handle language with a huge number of characters (far
too much to keep a full frequent list while keeping reasonable memory
consumption and speed).
But then I realize that this happens for languages which have anyway
their own set of characters.

For instance, modern Korean is near full hangul. Of course, we can find
some Chinese characters here and there, but nothing which should really
break confidence if we base it on the hangul ratio. Of course if some
day we want to go further and detect older Korean, we will have to
improve the logics a bit with some statistics, though I wonder if
limiting ourselves to character frequency is not enough here (sequence
frequency is maybe a bit overboard). To be tested.
In any case, this new class gives much more relevant confidence on
Korean texts, compared to the statistics data we previously generated.

For Japanese, it is a mix of kana and Chinese characters. A modern full
text cannot exist without a lot of kanas (probably only old text or very
short texts, such as titles, could have only Chinese characters). We
would still want to add a bit of statistics to differentiate correctly a
Japanese text with a lot of Chinese characters in it and a Chinese
text which quotes a bit of Japanese phrases. It will have to be
improved, but for now it works fairly ok.

A last case where we would want to play with statistics might be if we
want to differentiate between regional variants. For instance,
Simplified Chinese, Taiwan or Hong Kong Chinese… More to experiment
later on. It's already a first good step for UTF-8 support with
language!
2022-12-14 00:24:53 +01:00
Jehan
c782177a8d README: fix a duplicate. 2022-12-14 00:24:53 +01:00
Jehan
3ca49e2bc1 Update README. 2022-12-14 00:24:50 +01:00
Jehan
8113f604de src: consider any combination with a non-frequent character as sequence.
Basically since we excluse non-letters (Control chars, punctuations,
spaces, separators, emoticones and whatnot), we consider any remaining
character as an off-script letter (we may have forgotten some cases, but
so far, it looks promising). Hence it is normal to consider a
combination with these (i.e. 2 off-script letters or 1 frequent letter +
1 off-script, in any order) as a sequence too. Doing so will drop the
confidence even more of any text having too much of these. As a
consequence, it expands again the gap between the first and second
contender, which seems to really show it works.
2022-12-14 00:23:13 +01:00
Jehan
a1b186fa8b src: add Hindi/UTF-8 support. 2022-12-14 00:23:13 +01:00
Jehan
9736950227 src: improve confidence computation.
Detect various blocks of characters for punctuation, symbols, emoticons
and whatnot. These are considered kind of neutral in the confidence
(because it's normal to have punctuation, and various text nowadays are
expected to display emoticones or various symbols).
What is of interest is all the rest, which will then consider as
out-of-range characters (likely characters for other scripts) and will
therefore drop the confidence.

Now confidence will therefore take into account the ratio of all
in-range characters (script letters + various neutral characters) and
the ratio of frequent letters within all letters (script letters +
out-of-range characters).
This improved algorithm makes for much more efficient detection, as it
bumped most confidence in all our unit test, and usually increased the
gap between the first and second contender.
2022-12-14 00:23:13 +01:00
Jehan
a98cdcd88f script: fix a bit BuildLangModel.py when use_ascii is True.
In particular, I prepare the case for English detection. I am not
pushing actual English models yet, because it's not so efficient yet. I
will do when I will be able to handle better English confidence.
2022-12-14 00:23:13 +01:00
Jehan
629bc879f3 script, src: add generic Korean model.
Until now, Korean charsets had its own probers as there are no
single-byte encoding for writing Korean. I now added a Korean model only
for the generic character and sequence statistics.

I also improved the generation script (script/BuildLangModel.py) to
allow for languages without single-byte charset generation and to
provide meaningful statistics even when the language script has a lot of
characters (so we can't have a full sequence combination array, just too
much data). It's not perfect yet. For instance our UTF-8 Korean test
file ends up with confidence of 0.38503, which is low for obvious Korean
text. Still it works (correctly detected, with top confidence compared
to others) and is a first step toward more improvement for detection
confidence.
2022-12-14 00:23:13 +01:00
Jehan
0d152ff430 src, test: fix the new Johab prober and add a test.
This prober comes from MR !1 on the main branch though it was too
agressive then and could not get merged. On the improved API branch, it
doesn't detect other tests as Johab anymore.

Also fixing it to work with the new API.

Finally adding a Johab/ko unit test.
2022-12-14 00:23:13 +01:00
Jehan
3996b9d648 src: build new charset prober for Johab Korean.
CMake build was not completed and enum state nsSMState disappeared in
commit 53f7ad0.
Also fixing a few coding style bugs.

See discussion in MR !1.
2022-12-14 00:23:13 +01:00
LSY
d72a5c88ce add charset prober for Johab Korean 2022-12-14 00:23:13 +01:00
Jehan
ded948ce15 script, src: generate the Hebrew models.
The Hebrew Model had never been regenerated by my scripts. I now added
the base generation files.

Note that I added 2 charsets: ISO-8859-8 and WINDOWS-1255 but they are
nearly identical. One of the difference is that the generic currency
sign is replaced by the sheqel sign (Israel currency) in Windows-1255.
And though this one lost the "double low line", apparently some Yiddish
characters were added. Basically it looks like most Hebrew text would
work fine with the same confidence on both charsets and detecting both
is likely irrelevant. So I keep the charset file for ISO-8859-8, but
won't actually use it.

The good part is now that Hebrew is also recognized in UTF-8 text thanks
to the new code and newly generated language model.
2022-12-14 00:23:13 +01:00
Jehan
cf0ffb0c55 test: 4 new tests for UTF-8.
Taken from random pages for each of these languages.
I now have a test for every 26 supported couple of (UTF-8, language).
These are all working very fine and detected at the right encoding and
language.
2022-12-14 00:23:13 +01:00
Jehan
a7c5a167a9 src: drop the SURE_YES confidence for character distribution probers.
Some probers are based on character distribution analysis. Though it is
still relevant detection logics, we also know that it is a lot less
subtle than sequence distribution.

Therefore let's give a good confidence for a text passing such analysis,
yet not a near perfect one, thus leaving some chance for other probers.
In particular, we can definitely consider that if some text gets over
0.7 on sequence distribution analysis, this is a very likely candidate.

I had the case with the Finnish UTF-8 test which was passing (UTF-8,
Finnish) detection with a staggering 0.86 confidence, yet was overrided
by UHC (EUC-KR). This used to not be a problem when nsMBCSGroupProber
would check the UTF-8 prober first and stop there with just some basic
encoding detection. Now that we go further and return all relevant
candidates, some simpler detection algorithm which always return
too-good confidence is not the best idea.
2022-12-14 00:23:13 +01:00
Jehan
b00c85a6a6 src: do not shortcut UTF-8 detection too early.
I had the case with the Czech test which was considered as Irish after
being shortcutted far too early after only 16 characters. Confidence
values was just barely above 0.5 for Irish (and barely below for Czech).

By adding a threshold (at least 256 characters), we give a bit of
relevant data to the engine to actually make an informed decision. By
then, the Czech detection was at more than 0.7, whereas the Irish one at
0.6.
2022-12-14 00:23:13 +01:00
Jehan
2a16ab2310 src: nsEscCharsetProber also returns the correct language.
nsEscCharsetProber will still only return a single candidate, because
this is detected by a state machine, not language statistics anyway.
Anyway now it will also return the language attached to the encoding.
2022-12-14 00:23:13 +01:00
Jehan
6138d9e0f0 src: make nsMBCSGroupProber report all valid candidates.
Returning only the best one has limits, as it doesn't allow to check
very close confidence candidates. Now in particular, the UTF-8 prober
will return all ("UTF-8", lang) candidates for every language with
probable statistical fit.
2022-12-14 00:23:13 +01:00
Jehan
2127f4fc0d src: allow for nsCharSetProber to return several candidates.
No functional change yet because all probers still return 1 candidate.
Yet now we add a GetCandidates() method to return a number of
candidates.
GetCharSetName(), GetLanguage() and GetConfidence() now take a parameter
which is the candidate index (which must be below the return value of
GetCandidates()). We can now consider that nsCharSetProber computes a
couple (charset, language) and that the confidence is for this specific
couple, not just the confidence for charset detection.
2022-12-14 00:23:13 +01:00
Jehan
ea32980273 src: nsMBCSGroupProber confidence weighed by language confidence.
Since our whole charset detection logics is based on text having meaning
(using actual language statistics), just because a text is valid UTF-8
does not mean it is absolutely the right encoding. It may also fit other
encoding with maybe very high statistical confidence (and therefore a
better candidate).
Therefore instead of just returning 0.99 or other high values, let's
weigh our encoding confidence with the best language confidence.
2022-12-14 00:23:13 +01:00
Jehan
25d2890676 src: tweak again the language detection confidence.
Computing a logical number of sequence was a big mistake. In particular,
a language with only positive sequence would have the same score as a
language with a mix of only positive and probable sequence (i.e. 1.0).
Instead, just use the real number of sequence, but probable of sequence
don't bring +1 to the numerator.

Also drop the mTypicalPositiveRatio, at least for now. In my tests, it
mostly made results worse. Maybe this would still make sense for
language with a huge number of characters (like CJK languages), for
which we won't have the full list of characters in our "frequent" list
of characters. Yet for most other languages, we actually list all the
possible sequences within the character set, therefore any sequence out
of our sequence list should necessarily drop confidence. Tweaking the
result backup up with some ratio is therefore counter-productive.

As for CJK cases, we'll see how to handle the much higher number of
sequences (too many to list them all) when we get there.
2022-12-14 00:23:13 +01:00
Jehan
1b5e68be00 test: update unit test to check detected languages.
Excepting ASCII, UTF-16 and UTF-32 for which we don't detect languages
yet.
2022-12-14 00:23:13 +01:00
Jehan
82c1d2b25e src: reset language detectors when resetting a nsMBCSGroupProber. 2022-12-14 00:23:13 +01:00
Jehan
eb8308d50a src, script: regenerate all existing language models.
Now making sure that we have a generic language model working with UTF-8
for all 26 supported models which had single-byte encoding support until
now.
2022-12-14 00:23:13 +01:00
Jehan
5257fc1abf Using the generic language detector in UTF-8 detection.
Now the UTF-8 prober would not only detect valid UTF-8, but would also
detect the most probable language. Using the data generated 2 commits
away, this works very well.

This is still basic and will require even more improvements. In
particular, now the nsUTF8Prober should return an array of ("UTF-8",
language) couple candidate. And nsMBCSGroupProber should itself forward
these candidates as well as other candidates from other multi-byte
detectors. This way, the public-facing API would get more probable
candidates, in case the algorithm is slightly wrong.

Also the UTF-8 confidence is currently stupidly high as soon as we
consider it to be right. We should likely weigh it with language
detection (in particular, if no language is detected, this should
severely weigh down UTF-8 detection; not to 0, but high enough to be a
fallback in case no other encoding+lang is valid and low enough to give
chances to other good candidate couples.
2022-12-14 00:23:13 +01:00
Jehan
dac7cbd30f New generic language detector class.
It detects languages similarly to the single byte encoding detector
algorithm, based on character frequency and sequence frequency, except
it does it generically from unicode codepoint, not caring at all about
the original encoding.

The confidence algorithm for language is very similar to the confidence
algorithm for encoding+language in nsSBCharSetProber, though I tweaked
it a little making it more trustworthy. And I plan to tweak it even a
bit more later, as I improve progressively the detection logics with
some of the idea I had.
2022-12-14 00:23:13 +01:00
Jehan
b70b1ebf88 Rebuild a bunch of language models.
Adding generic language model (see coming commit), which uses the same
data as specific single-byte encoding statistics model, except that it
applies it to unicode code points.
For this to work, instead of the CharToOrderMap which was mapping
directly from encoded byte (always 256 values) to order, now we add an
array of frequent characters, ordered by generic unicode code points to
the order of frequency (which can be used on the same sequence mapping
array).

This of course means that each prober where we will want to use these
generic models will have to implement their own byte to code point
decoder, as this is per-encoding logics anyway. This will come in a
subsequent commit.
2022-12-14 00:23:13 +01:00
Jehan
a0bfba3db3 src: add a --weight option to the CLI tool.
Syntax is: lang1:weight1,lang2:weight2…
For instance: `uchardet -wfr:1.1,it:1.05 file.txt` if you think a file
is probably French or maybe Italian.
2022-12-14 00:23:13 +01:00
Jehan
669ede73a3 src: new weight concept in the C API.
Pretty basic, you can weight prefered language and this will impact the
result. Say the algorithm "hesitates" between encoding E1 in language L1
and encoding E2 in language L2. By setting L2 with a 1.1 weight, for
instance because this is the OS language, or usual prefered language,
you may help the algorithm to overcome very tight cases.

It can also be helpful when you already know for sure the language of a
document, you just don't know its encoding. Then you may set a very high
value for this language, or simply set a default value of 0, and set 1
for this language. Only relevant encoding will be taken into account.

This is still limited though as generic encoding are still implemented
language-agnostic. UTF-8 for instance would be disadvantaged by this
weight system until we make it language-aware.
2022-12-14 00:23:13 +01:00
Jehan
f74d602449 src: fix the usage of uchardet tool.
It was displaying -v for both verbose and version options. The new
--verbose short option is actually -V (uppercase).
2022-12-14 00:23:13 +01:00
Jehan
d48ee7abc2 src: uchardet tool now shows the language code in verbose mode. 2022-12-14 00:23:13 +01:00
Jehan
c550af99a7 script: update BuildLangModel.py to updated SequenceModel struct.
In particular, there is now a language code member.
2022-12-14 00:23:13 +01:00
Jehan
5a949265d5 src: new API to get the detected language.
This doesn't work for all probers yet, in particular not for the most
generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL.
It's still a good first step.

Right now, it returns the 2-character language code from ISO 639-1. A
using project could easily get the English language name from the
XML/json files provided by the iso-codes project. This project will also
allow to easily localize the language name in other languages through
gettext (this is what we do in GIMP for instance). I don't add any
dependency though and leave it to downstream projects to implement this.

I was also wondering if we want to support region information for cases
when it would make sense. I especially wondered about it for Chinese
encodings as some of them seem quite specific to a region (according to
Wikipedia at least). For the time being though, these just return "zh".
We'll see later if it makes sense to be more accurate (maybe depending
on reports?).
2022-12-14 00:23:13 +01:00
Jehan
e7bf25ca08 test: fix test script to use the new API and get rid of build warning. 2022-12-14 00:23:13 +01:00
Jehan
7bc1bc4e0a src: new option --verbose|-V in the uchardet CLI tool.
This new option will give the whole candidate list as well as their
respective confidence (ordered by higher to lower).
2022-12-14 00:23:13 +01:00
Jehan
8118133e00 src: new API to get all candidates and their confidence.
Adding:
- uchardet_get_candidates()
- uchardet_get_encoding()
- uchardet_get_confidence()

Also deprecating uchardet_get_charset() to have developers look at the
new API instead. I was unsure if this should really get deprecated as it
makes the basic case simple, but the new API is just as easy anyway. You
can also directly call uchardet_get_encoding() with candidate 0 (same as
uchardet_get_charset(), it would then return "" when no candidate was
found).
2022-12-14 00:23:13 +01:00
Jehan
15fc8f0a0f src: now reporting encoding+confidence and keeping a list.
Preparing for an updated API which will also allow to loop at the
confidence value, as well as get the list of possible candidate (i.e.
all detected encoding which had a confidence value high enough so that
we would even consider them).
It is still only internal logics though.
2022-12-14 00:23:13 +01:00
Jehan
2f5c24006e README, doc: some README and release procedure updates. 2022-12-08 22:34:22 +01:00
Jehan
ae6302a016 Release: version 0.0.8. v0.0.8 2022-12-08 21:52:25 +01:00
Jehan
c218a3ccd6 README: add a section about CMake exported targets.
Since it's a new feature, we may as well write about it, even though I
would personally not recommend this in favor of more standard and
generic pkg-config (which is not dependent on which build system we are
using ourselves).
2022-11-30 23:48:16 +01:00
Jehan
6196f86c46 README: update with newly added (lang, charset) couples. 2022-11-30 20:06:52 +01:00
Jehan
388777be51 script, src, test: add IBM865 support for Danish.
Newly added IBM865 charset (for Norwegian) can also be used for Danish

By the way, I fixed `script/charsets/ibm865.py` as Danish uses the 'da'
ISO 639-1 code by the way, not 'dk' (which is sometimes used for other
codes for Denmark, such as ISO 3166 country code and internet TLD) but
not for the language itself.

For the test, adding some text from the top article of the day on the
Danish Wikipedia, which was about Jimi Hendrix. And that's cool! 🎸 ;-)
2022-11-30 19:57:52 +01:00
Jehan
5aa628272b script: fix small issues with commits e41e8a4 and 8d15d6b. 2022-11-30 19:24:28 +01:00
Martin T. H. Sandsmark
c11c362b89 Add tests for norwegian 2022-11-30 19:09:21 +01:00
Martin T. H. Sandsmark
099a9a4fd6 Add norwegian support 2022-11-30 19:09:09 +01:00
Martin T. H. Sandsmark
e41e8a47e4 improve model building script a bit 2022-11-30 19:09:09 +01:00
Martin T. H. Sandsmark
8d15d6b557 make the logfile usable 2022-11-30 19:09:09 +01:00
Jehan
2a04e57c8f test: update the Maltese / ISO-8859-3 test file.
Taken from the page: https://mt.wikipedia.org/wiki/Lingwa_Maltija
The old test was fine but had some French words in it, which lowered the
confidence for Maltese.
Technically it should not be a huge issue in the end, i.e. that if there
are enough actual Maltese words, the stats should still weigh in favor
of Maltese likeness (which they mostly did anyway), but since I am
making some other changes, this was just not enough. In particular I was
changing some of the UTF-8 confidence logics and the file ended up
detected as UTF-8 (even though it has illegal sequence and cannot be!
Cf. #9).

So the real long-term solution is to actually fix our UTF-8 detector,
which I'll do at some point, but for the time being, let's have definite
non-questionable Maltese in there to simplify testing at this early
stage of uchardet rewriting.
2022-11-29 14:59:17 +01:00
Lucinda May Phipps
45bd32d102 src/tools/uchardet.cpp: make stuff static 2022-11-29 13:57:31 +00:00
Lucinda May Phipps
ef19faa8c5 Update uchardet-tests.c 2022-11-29 13:57:31 +00:00