288 Commits

Author SHA1 Message Date
Jehan
5463f4e0c0 src: nsEscCharsetProber also returns the correct language.
nsEscCharsetProber will still only return a single candidate, because
this is detected by a state machine, not language statistics anyway.
Anyway now it will also return the language attached to the encoding.
2021-03-17 17:15:56 +01:00
Jehan
ba6b46a68c src: make nsMBCSGroupProber report all valid candidates.
Returning only the best one has limits, as it doesn't allow to check
very close confidence candidates. Now in particular, the UTF-8 prober
will return all ("UTF-8", lang) candidates for every language with
probable statistical fit.
2021-03-17 16:38:20 +01:00
Jehan
49ed0e6f45 src: allow for nsCharSetProber to return several candidates.
No functional change yet because all probers still return 1 candidate.
Yet now we add a GetCandidates() method to return a number of
candidates.
GetCharSetName(), GetLanguage() and GetConfidence() now take a parameter
which is the candidate index (which must be below the return value of
GetCandidates()). We can now consider that nsCharSetProber computes a
couple (charset, language) and that the confidence is for this specific
couple, not just the confidence for charset detection.
2021-03-17 13:29:13 +01:00
Jehan
41fc0f235b src: nsMBCSGroupProber confidence weighed by language confidence.
Since our whole charset detection logics is based on text having meaning
(using actual language statistics), just because a text is valid UTF-8
does not mean it is absolutely the right encoding. It may also fit other
encoding with maybe very high statistical confidence (and therefore a
better candidate).
Therefore instead of just returning 0.99 or other high values, let's
weigh our encoding confidence with the best language confidence.
2021-03-17 13:09:10 +01:00
Jehan
714ae9ca29 src: tweak again the language detection confidence.
Computing a logical number of sequence was a big mistake. In particular,
a language with only positive sequence would have the same score as a
language with a mix of only positive and probable sequence (i.e. 1.0).
Instead, just use the real number of sequence, but probable of sequence
don't bring +1 to the numerator.

Also drop the mTypicalPositiveRatio, at least for now. In my tests, it
mostly made results worse. Maybe this would still make sense for
language with a huge number of characters (like CJK languages), for
which we won't have the full list of characters in our "frequent" list
of characters. Yet for most other languages, we actually list all the
possible sequences within the character set, therefore any sequence out
of our sequence list should necessarily drop confidence. Tweaking the
result backup up with some ratio is therefore counter-productive.

As for CJK cases, we'll see how to handle the much higher number of
sequences (too many to list them all) when we get there.
2021-03-17 12:51:25 +01:00
Jehan
26ed628061 test: update unit test to check detected languages.
Excepting ASCII, UTF-16 and UTF-32 for which we don't detect languages
yet.
2021-03-17 12:39:54 +01:00
Jehan
f30c1cd8c8 src: reset language detectors when resetting a nsMBCSGroupProber. 2021-03-17 11:03:30 +01:00
Jehan
5c3a2e8037 src, script: regenerate all existing language models.
Now making sure that we have a generic language model working with UTF-8
for all 26 supported models which had single-byte encoding support until
now.
2021-03-17 02:07:17 +01:00
Jehan
2a4d8d890e Using the generic language detector in UTF-8 detection.
Now the UTF-8 prober would not only detect valid UTF-8, but would also
detect the most probable language. Using the data generated 2 commits
away, this works very well.

This is still basic and will require even more improvements. In
particular, now the nsUTF8Prober should return an array of ("UTF-8",
language) couple candidate. And nsMBCSGroupProber should itself forward
these candidates as well as other candidates from other multi-byte
detectors. This way, the public-facing API would get more probable
candidates, in case the algorithm is slightly wrong.

Also the UTF-8 confidence is currently stupidly high as soon as we
consider it to be right. We should likely weigh it with language
detection (in particular, if no language is detected, this should
severely weigh down UTF-8 detection; not to 0, but high enough to be a
fallback in case no other encoding+lang is valid and low enough to give
chances to other good candidate couples.
2021-03-16 18:37:09 +01:00
Jehan
04c4fd419d New generic language detector class.
It detects languages similarly to the single byte encoding detector
algorithm, based on character frequency and sequence frequency, except
it does it generically from unicode codepoint, not caring at all about
the original encoding.

The confidence algorithm for language is very similar to the confidence
algorithm for encoding+language in nsSBCharSetProber, though I tweaked
it a little making it more trustworthy. And I plan to tweak it even a
bit more later, as I improve progressively the detection logics with
some of the idea I had.
2021-03-16 18:37:09 +01:00
Jehan
9518f4d7a2 Rebuild a bunch of language models.
Adding generic language model (see coming commit), which uses the same
data as specific single-byte encoding statistics model, except that it
applies it to unicode code points.
For this to work, instead of the CharToOrderMap which was mapping
directly from encoded byte (always 256 values) to order, now we add an
array of frequent characters, ordered by generic unicode code points to
the order of frequency (which can be used on the same sequence mapping
array).

This of course means that each prober where we will want to use these
generic models will have to implement their own byte to code point
decoder, as this is per-encoding logics anyway. This will come in a
subsequent commit.
2021-03-16 12:35:18 +01:00
Jehan
82347030ba src: add a --weight option to the CLI tool.
Syntax is: lang1:weight1,lang2:weight2…
For instance: `uchardet -wfr:1.1,it:1.05 file.txt` if you think a file
is probably French or maybe Italian.
2021-03-14 00:12:30 +01:00
Jehan
7f99b91388 src: new weight concept in the C API.
Pretty basic, you can weight prefered language and this will impact the
result. Say the algorithm "hesitates" between encoding E1 in language L1
and encoding E2 in language L2. By setting L2 with a 1.1 weight, for
instance because this is the OS language, or usual prefered language,
you may help the algorithm to overcome very tight cases.

It can also be helpful when you already know for sure the language of a
document, you just don't know its encoding. Then you may set a very high
value for this language, or simply set a default value of 0, and set 1
for this language. Only relevant encoding will be taken into account.

This is still limited though as generic encoding are still implemented
language-agnostic. UTF-8 for instance would be disadvantaged by this
weight system until we make it language-aware.
2021-03-14 00:12:30 +01:00
Jehan
f15d097f29 src: fix the usage of uchardet tool.
It was displaying -v for both verbose and version options. The new
--verbose short option is actually -V (uppercase).
2021-03-14 00:12:30 +01:00
Jehan
4a891ec4ac src: uchardet tool now shows the language code in verbose mode. 2021-03-14 00:12:30 +01:00
Jehan
1db089c7f8 script: update BuildLangModel.py to updated SequenceModel struct.
In particular, there is now a language code member.
2021-03-14 00:12:30 +01:00
Jehan
911695f682 src: new API to get the detected language.
This doesn't work for all probers yet, in particular not for the most
generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL.
It's still a good first step.

Right now, it returns the 2-character language code from ISO 639-1. A
using project could easily get the English language name from the
XML/json files provided by the iso-codes project. This project will also
allow to easily localize the language name in other languages through
gettext (this is what we do in GIMP for instance). I don't add any
dependency though and leave it to downstream projects to implement this.

I was also wondering if we want to support region information for cases
when it would make sense. I especially wondered about it for Chinese
encodings as some of them seem quite specific to a region (according to
Wikipedia at least). For the time being though, these just return "zh".
We'll see later if it makes sense to be more accurate (maybe depending
on reports?).
2021-03-14 00:12:30 +01:00
Jehan
d1ed97b813 test: fix test script to use the new API and get rid of build warning. 2021-03-14 00:12:30 +01:00
Jehan
ae4e3a7cbe src: new option --verbose|-V in the uchardet CLI tool.
This new option will give the whole candidate list as well as their
respective confidence (ordered by higher to lower).
2021-03-14 00:12:30 +01:00
Jehan
4da22cca97 src: new API to get all candidates and their confidence.
Adding:
- uchardet_get_candidates()
- uchardet_get_encoding()
- uchardet_get_confidence()

Also deprecating uchardet_get_charset() to have developers look at the
new API instead. I was unsure if this should really get deprecated as it
makes the basic case simple, but the new API is just as easy anyway. You
can also directly call uchardet_get_encoding() with candidate 0 (same as
uchardet_get_charset(), it would then return "" when no candidate was
found).
2021-03-14 00:12:30 +01:00
Jehan
b43d938804 src: now reporting encoding+confidence and keeping a list.
Preparing for an updated API which will also allow to loop at the
confidence value, as well as get the list of possible candidate (i.e.
all detected encoding which had a confidence value high enough so that
we would even consider them).
It is still only internal logics though.
2021-03-14 00:12:30 +01:00
Aaron Madlon-Kay
6f38ab95f5 Mention MacPorts in readme 2021-01-27 06:57:58 +00:00
Jehan
c8a3572cca Issue #17: update README.
Replace the old link to the science paper by one on archive-mozilla
website. Remove the original source link as I can't find any archived
version of it (even on archive.org, only the folder structure is saved,
not actual files themselves, so it's useless).

Also add some history, which is probably a nice touch.

Add a link to crossroad to help people who'd want to cross-compile
uchardet.

Finally add the R binding by Artem Klevtsov and QtAV as reported.
2020-04-29 16:20:00 +02:00
Jehan
472a906844 Issue #16: "i686" uname not properly detected as x86.
This is basically a continuation of an older bug from Bugzilla:
https://bugs.freedesktop.org/show_bug.cgi?id=101033
2020-04-28 20:43:12 +02:00
myd7349
8681fc060e build: Add uchardet CLI tool building support for MSVC 2020-04-26 08:16:14 +00:00
myd7349
5bcbd23acf build: Fix build errors on Windows
- Fix string no output variables on UWP

  On UWP, CMAKE_SYSTEM_PROCESSOR may be empty. As a result:
  string(TOLOWER ${CMAKE_SYSTEM_PROCESSOR} TARGET_ARCHITECTURE)
  will be treated as:
  string(TOLOWER TARGET_ARCHITECTURE)
  which, as a result, will cause a CMake error:

  CMake Error at CMakeLists.txt:42 (string):
    string no output variable specified

- Remove unnecessary header inclusions in uchardet.cpp

  These extra inclusions cause build errors on Windows.
2020-04-26 10:08:45 +08:00
Jehan
a49f8ef6ea doc: update README.maintainer.
There is one more step to transform a git tag into a proper "Gitlab
release" with the new platform.
2020-04-23 12:32:49 +02:00
Jehan
59f68dbe57 Release: version 0.0.7 v0.0.7 2020-04-23 11:48:58 +02:00
Jehan
98bc2f31ef Issue #8: have BuildLangModel.py add ending newline to generated source. 2020-04-22 22:57:25 +02:00
Jehan
44a50c30ee Issue #8: no newline at end of file.
Not sure if it is in the C++ standard, or was, but apparently some
compilers may complain when files don't end with a newline (though
neither GCC nor Clang as our CI and my local builds are fine).

So here are all our generated source which didn't have such ending
newline (hopefully I forgot none). I just loaded them in my vim editor,
and resaved them. This was enough to add an ending newline.
2020-04-22 22:53:25 +02:00
Jehan
6c7f32a751 Issue #10: Crashing sequence with nsSJISProber.
uchardet_handle_data() should not try to process data of nul length.

Still this is not technically an error to feed empty data to the engine,
and I could imagine it could happen especially when done in some
automatic process with random input files (which looks like what was
happening in the reporter case). So feeding empty data just returns a
success without actually doing any processing, allowing to continue the
data feed.
2020-04-22 22:11:51 +02:00
Jehan
ef0313046b Also allow uchardet tool to detect encoding of a file named "--".
My previous commit was good except for the very special case of wanting
to analyze a file named "--". This file would be ignored.

With this change, only the first "--" option will be ignored as meaning
"end of option arguments", but any remaining value (another "--"
included) will be considered as a file path.
2020-04-22 21:11:23 +02:00
Jehan
4a37dfdf1c Issue #15: support "--" end-of-option. 2020-04-22 21:05:44 +02:00
wangqr
ae7acbd0f2 Add dllexport to interface functions
This allows building the DLL on Windows with other compilers than GNU ones.
See MR !4.
2020-04-22 18:54:07 +00:00
Artem Klevtsov
2694ba6363 Fix global-buffer-overflow due EUCTW_TABLE_SIZE 2020-04-22 17:06:40 +00:00
Jehan
81ab1d1da1 gitlab-ci: Adding a Clang build. 2020-04-22 18:04:56 +02:00
Jehan
6afec53adc gitlab-ci: Windows 32 and 64-bit builds. 2020-04-22 18:00:36 +02:00
Jehan
b5674dbd50 gitlab-ci: first CI build for uchardet.
Very simple CI since uchardet is an extremely low/no dependency library.
So basically we install CMake in Debian/testing and we are good.
2020-04-22 17:22:23 +02:00
Jehan
e0b9269849 Fix various other occurrences of bug tracker URL in code/build. 2020-04-22 12:29:41 +02:00
Jehan
60bf53c81e README: update to Gitlab links.
Freedesktop moved its infrastructure to Gitlab a while ago.
2020-04-22 00:33:48 +02:00
Jehan
0cfb75724a README: some small updates. 2020-04-22 00:17:23 +02:00
Jehan
bdfd6116a9 Add a mention about fd.o code of conduct. 2018-09-26 15:12:25 +02:00
Ilya Tumaykin
f136d434f0 build: turn TARGET_ARCHITECTURE into option
Default value is autodetected if not specified by user.
2018-01-21 15:58:13 +01:00
Jehan
95872ef41c Adding some information about building for Windows. 2017-12-26 03:37:42 +01:00
Jehan
df67ae4fe0 CMake: get rid of some commented code.
It says that's for Win32 platform and uses the install prefix as library
prefix. But that's not at all the same kind of prefixes!
CMAKE_INSTALL_PREFIX expected value is the path to install the lib (what
is called the "installation prefix"), whereas CMAKE_*_LIBRARY_PREFIX are
the prefix on the file name (usually "lib" on UNIX-like systems).
Anyway I don't see a need to change this value. It will be called
"libuchardet.dll" on Win32. I don't see the problem.
Also this code was already commented out, and compilation and usage for
Win32 works just fine without it. :-)
2017-12-24 19:47:05 +01:00
Jehan
cd617d181d CMake: do not check/set SSE and float-store options on non-x86 targets.
Not sure if that's right. I guess we might also find non-x86 machines
where floating point computation won't follow IEEE standard as well. But
let's do this for now to prevent from useless performance hit.
2017-11-07 00:37:54 +01:00
Jehan
939482ab2b CMake: slightly improve the configuration option messages.
Also add full stops, similarly to CMake defaut options.
2017-11-06 02:11:20 +01:00
Jehan
77bf71ea36 CMake: rename s/ENABLE_SSE2/CHECK_SSE2/.
"ENABLE_SSE2" may be misleading since having it ON does not necessarily
mean that SSE2 flags will be actually set. It only means that the
support will be checked (then set only when supported).
Also adding the warning about possible performance decrease.
2017-11-06 02:07:40 +01:00
Jehan
5996bbd995 Bug 101033 - Testsuite fails on i386.
Floating point accuracy may be different depending on the architecture.
In particular some architectures may store floating values with
different precision, resulting in unreliable results across various
machines. It would seem in particular true on older x86 machines without
SSE support, which were reported cases.
The proposed solution is to test for SSE support and explicitly add the
proper flags (even though they are set by default anyway on modern x86).
When this is not available (on older machines or simply when not on x86
processors), I replace sse2 flags with -ffloat-store, which forces IEEE
floating point definition.
The reason why not to always force -ffloat-store is because it seems to
decrease performance on some machines. SSE is prefered if available.

I also add a ENABLE_SSE2 option on the CMake file to allow builders to
use -ffloat-store even though SSE2 may be available on the build
machine. This would allow to build portable binaries which can also be
installed on older machines.
2017-11-06 01:56:45 +01:00
Jehan
056a5a6e51 README: add some applications having uchardet as dependency.
There are likely more (and I know some are planning support) but these
are the ones I know of and with support already in.
2017-09-21 00:06:03 +02:00