189 Commits

Author SHA1 Message Date
Jehan
25d2890676 src: tweak again the language detection confidence.
Computing a logical number of sequence was a big mistake. In particular,
a language with only positive sequence would have the same score as a
language with a mix of only positive and probable sequence (i.e. 1.0).
Instead, just use the real number of sequence, but probable of sequence
don't bring +1 to the numerator.

Also drop the mTypicalPositiveRatio, at least for now. In my tests, it
mostly made results worse. Maybe this would still make sense for
language with a huge number of characters (like CJK languages), for
which we won't have the full list of characters in our "frequent" list
of characters. Yet for most other languages, we actually list all the
possible sequences within the character set, therefore any sequence out
of our sequence list should necessarily drop confidence. Tweaking the
result backup up with some ratio is therefore counter-productive.

As for CJK cases, we'll see how to handle the much higher number of
sequences (too many to list them all) when we get there.
2022-12-14 00:23:13 +01:00
Jehan
82c1d2b25e src: reset language detectors when resetting a nsMBCSGroupProber. 2022-12-14 00:23:13 +01:00
Jehan
eb8308d50a src, script: regenerate all existing language models.
Now making sure that we have a generic language model working with UTF-8
for all 26 supported models which had single-byte encoding support until
now.
2022-12-14 00:23:13 +01:00
Jehan
5257fc1abf Using the generic language detector in UTF-8 detection.
Now the UTF-8 prober would not only detect valid UTF-8, but would also
detect the most probable language. Using the data generated 2 commits
away, this works very well.

This is still basic and will require even more improvements. In
particular, now the nsUTF8Prober should return an array of ("UTF-8",
language) couple candidate. And nsMBCSGroupProber should itself forward
these candidates as well as other candidates from other multi-byte
detectors. This way, the public-facing API would get more probable
candidates, in case the algorithm is slightly wrong.

Also the UTF-8 confidence is currently stupidly high as soon as we
consider it to be right. We should likely weigh it with language
detection (in particular, if no language is detected, this should
severely weigh down UTF-8 detection; not to 0, but high enough to be a
fallback in case no other encoding+lang is valid and low enough to give
chances to other good candidate couples.
2022-12-14 00:23:13 +01:00
Jehan
dac7cbd30f New generic language detector class.
It detects languages similarly to the single byte encoding detector
algorithm, based on character frequency and sequence frequency, except
it does it generically from unicode codepoint, not caring at all about
the original encoding.

The confidence algorithm for language is very similar to the confidence
algorithm for encoding+language in nsSBCharSetProber, though I tweaked
it a little making it more trustworthy. And I plan to tweak it even a
bit more later, as I improve progressively the detection logics with
some of the idea I had.
2022-12-14 00:23:13 +01:00
Jehan
b70b1ebf88 Rebuild a bunch of language models.
Adding generic language model (see coming commit), which uses the same
data as specific single-byte encoding statistics model, except that it
applies it to unicode code points.
For this to work, instead of the CharToOrderMap which was mapping
directly from encoded byte (always 256 values) to order, now we add an
array of frequent characters, ordered by generic unicode code points to
the order of frequency (which can be used on the same sequence mapping
array).

This of course means that each prober where we will want to use these
generic models will have to implement their own byte to code point
decoder, as this is per-encoding logics anyway. This will come in a
subsequent commit.
2022-12-14 00:23:13 +01:00
Jehan
a0bfba3db3 src: add a --weight option to the CLI tool.
Syntax is: lang1:weight1,lang2:weight2…
For instance: `uchardet -wfr:1.1,it:1.05 file.txt` if you think a file
is probably French or maybe Italian.
2022-12-14 00:23:13 +01:00
Jehan
669ede73a3 src: new weight concept in the C API.
Pretty basic, you can weight prefered language and this will impact the
result. Say the algorithm "hesitates" between encoding E1 in language L1
and encoding E2 in language L2. By setting L2 with a 1.1 weight, for
instance because this is the OS language, or usual prefered language,
you may help the algorithm to overcome very tight cases.

It can also be helpful when you already know for sure the language of a
document, you just don't know its encoding. Then you may set a very high
value for this language, or simply set a default value of 0, and set 1
for this language. Only relevant encoding will be taken into account.

This is still limited though as generic encoding are still implemented
language-agnostic. UTF-8 for instance would be disadvantaged by this
weight system until we make it language-aware.
2022-12-14 00:23:13 +01:00
Jehan
f74d602449 src: fix the usage of uchardet tool.
It was displaying -v for both verbose and version options. The new
--verbose short option is actually -V (uppercase).
2022-12-14 00:23:13 +01:00
Jehan
d48ee7abc2 src: uchardet tool now shows the language code in verbose mode. 2022-12-14 00:23:13 +01:00
Jehan
5a949265d5 src: new API to get the detected language.
This doesn't work for all probers yet, in particular not for the most
generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL.
It's still a good first step.

Right now, it returns the 2-character language code from ISO 639-1. A
using project could easily get the English language name from the
XML/json files provided by the iso-codes project. This project will also
allow to easily localize the language name in other languages through
gettext (this is what we do in GIMP for instance). I don't add any
dependency though and leave it to downstream projects to implement this.

I was also wondering if we want to support region information for cases
when it would make sense. I especially wondered about it for Chinese
encodings as some of them seem quite specific to a region (according to
Wikipedia at least). For the time being though, these just return "zh".
We'll see later if it makes sense to be more accurate (maybe depending
on reports?).
2022-12-14 00:23:13 +01:00
Jehan
7bc1bc4e0a src: new option --verbose|-V in the uchardet CLI tool.
This new option will give the whole candidate list as well as their
respective confidence (ordered by higher to lower).
2022-12-14 00:23:13 +01:00
Jehan
8118133e00 src: new API to get all candidates and their confidence.
Adding:
- uchardet_get_candidates()
- uchardet_get_encoding()
- uchardet_get_confidence()

Also deprecating uchardet_get_charset() to have developers look at the
new API instead. I was unsure if this should really get deprecated as it
makes the basic case simple, but the new API is just as easy anyway. You
can also directly call uchardet_get_encoding() with candidate 0 (same as
uchardet_get_charset(), it would then return "" when no candidate was
found).
2022-12-14 00:23:13 +01:00
Jehan
15fc8f0a0f src: now reporting encoding+confidence and keeping a list.
Preparing for an updated API which will also allow to loop at the
confidence value, as well as get the list of possible candidate (i.e.
all detected encoding which had a confidence value high enough so that
we would even consider them).
It is still only internal logics though.
2022-12-14 00:23:13 +01:00
Jehan
388777be51 script, src, test: add IBM865 support for Danish.
Newly added IBM865 charset (for Norwegian) can also be used for Danish

By the way, I fixed `script/charsets/ibm865.py` as Danish uses the 'da'
ISO 639-1 code by the way, not 'dk' (which is sometimes used for other
codes for Denmark, such as ISO 3166 country code and internet TLD) but
not for the language itself.

For the test, adding some text from the top article of the day on the
Danish Wikipedia, which was about Jimi Hendrix. And that's cool! 🎸 ;-)
2022-11-30 19:57:52 +01:00
Martin T. H. Sandsmark
099a9a4fd6 Add norwegian support 2022-11-30 19:09:09 +01:00
Lucinda May Phipps
45bd32d102 src/tools/uchardet.cpp: make stuff static 2022-11-29 13:57:31 +00:00
Lucinda May Phipps
383bf118c9 don't use feof 2022-11-29 13:57:31 +00:00
Pedro López-Cabanillas
d7dad549bd cmake exported targets
The minimum required cmake version is raised to 3.1,
because the exported targets started at that version.

The build system creates the exported targets:
- The executable uchardet::uchardet
- The library uchardet::libuchardet
- The static library uchardet::libuchardet_static

A downstream project using CMake can find and link the library target
directly with cmake (without needing pkg-config) this way:

~~~
project(sample LANGUAGES C)
find_package ( uchardet )
if (uchardet_FOUND)
  add_executable( sample sample.c )
  target_link_libraries ( sample PRIVATE uchardet::libuchardet )
endif ()
~~~

After installing uchardet in a prefix like "$HOME/uchardet/":
cmake -DCMAKE_PREFIX_PATH="$HOME/uchardet/;..."

Instead installing, the build directory can be used directly, for
instance:

cmake -Duchardet_DIR="$HOME/uchardet-0.1.0/build/" ...
2021-11-09 09:52:15 +00:00
myd7349
8681fc060e build: Add uchardet CLI tool building support for MSVC 2020-04-26 08:16:14 +00:00
myd7349
5bcbd23acf build: Fix build errors on Windows
- Fix string no output variables on UWP

  On UWP, CMAKE_SYSTEM_PROCESSOR may be empty. As a result:
  string(TOLOWER ${CMAKE_SYSTEM_PROCESSOR} TARGET_ARCHITECTURE)
  will be treated as:
  string(TOLOWER TARGET_ARCHITECTURE)
  which, as a result, will cause a CMake error:

  CMake Error at CMakeLists.txt:42 (string):
    string no output variable specified

- Remove unnecessary header inclusions in uchardet.cpp

  These extra inclusions cause build errors on Windows.
2020-04-26 10:08:45 +08:00
Jehan
44a50c30ee Issue #8: no newline at end of file.
Not sure if it is in the C++ standard, or was, but apparently some
compilers may complain when files don't end with a newline (though
neither GCC nor Clang as our CI and my local builds are fine).

So here are all our generated source which didn't have such ending
newline (hopefully I forgot none). I just loaded them in my vim editor,
and resaved them. This was enough to add an ending newline.
2020-04-22 22:53:25 +02:00
Jehan
6c7f32a751 Issue #10: Crashing sequence with nsSJISProber.
uchardet_handle_data() should not try to process data of nul length.

Still this is not technically an error to feed empty data to the engine,
and I could imagine it could happen especially when done in some
automatic process with random input files (which looks like what was
happening in the reporter case). So feeding empty data just returns a
success without actually doing any processing, allowing to continue the
data feed.
2020-04-22 22:11:51 +02:00
Jehan
ef0313046b Also allow uchardet tool to detect encoding of a file named "--".
My previous commit was good except for the very special case of wanting
to analyze a file named "--". This file would be ignored.

With this change, only the first "--" option will be ignored as meaning
"end of option arguments", but any remaining value (another "--"
included) will be considered as a file path.
2020-04-22 21:11:23 +02:00
Jehan
4a37dfdf1c Issue #15: support "--" end-of-option. 2020-04-22 21:05:44 +02:00
wangqr
ae7acbd0f2 Add dllexport to interface functions
This allows building the DLL on Windows with other compilers than GNU ones.
See MR !4.
2020-04-22 18:54:07 +00:00
Artem Klevtsov
2694ba6363 Fix global-buffer-overflow due EUCTW_TABLE_SIZE 2020-04-22 17:06:40 +00:00
Jehan
e0b9269849 Fix various other occurrences of bug tracker URL in code/build. 2020-04-22 12:29:41 +02:00
Jehan
1898847eb6 src: cast value to its proper type.
Thanks to Marino Faggiana for reporting it.
See: https://github.com/BYVoid/uchardet/issues/37
2017-08-27 13:01:30 +02:00
Jehan
170ef349cf src: fix some doc comments. s/a instance/an instance/.
Unless mistaken, we should use "an" with next word starting with
vowel.
2017-08-19 10:46:25 +02:00
Jehan
c049332c41 src: s/detctor/detector/. 2017-08-18 12:03:54 +02:00
Jehan
53f7ad0e0b Bug 101032 - assignments to nsSMState in nsCodingStateMachine result...
... in unspecified behavior.
When compiling with UBSan (-fsanitize=undefined), execution complains:
> runtime error: load of value 5, which is not a valid value for type 'nsSMState'
Since the machine states depend on every different charset's state
machine, it is not possible to simply extend the enum with more generic
values. Instead let's just make the state as an unsigned int value and
define the 3 generic states as constants.
2017-05-28 20:01:06 +02:00
Jehan
50bc02c0ff Request C++11 standard project-wise and make it a strong requirement.
It is unneeded to do it by target, using the globale property
CMAKE_CXX_STANDARD instead. Also with CMAKE_CXX_STANDARD_REQUIRED, I
make this a strong requirement. The documentation indeed states that the
CXX_STANDARD "is treated as optional and may “decay” to a previous
standard if the requested is not available".
This means that uchardet will likely not be buildable with a compiler
with no C++11 support. But I assume this is not a common situation, and
probably we should not care about outdated compilers. I remain open to
suggestions and disagreement on the topic obviously.
2017-05-28 15:43:44 +02:00
Jehan
1bf198cb0f Make C++11 the standard used for uchardet.
As discussed in bug 101032, it seems like the most common usage
nowadays. Let's make a specific choice to avoid different behavior on
different builds later on.
2017-05-28 15:32:06 +02:00
Jehan
98bf4d73fd Bug 101204 - different results with different chunk sizes.
ASCII and ISO-8859-1 should not be detected in
nsUniversalDetector::HandleData() but in nsUniversalDetector::DataEnd()
instead. Otherwise it creates an unwanted shortcut from the first call
to uchardet_handle_data() if the input is broken into several pieces and
if the first chunk happens to be ASCII (or ASCII + NBSP).
2017-05-28 14:14:48 +02:00
Jehan
50743e16f8 src: minor indentation fix. 2017-05-14 21:35:11 +02:00
Jehan
94b10b9b29 Bug 101030 - Buffer overflow related to ISO2022JP detection in...
... en:ascii and ja:iso-2022-jp tests.
I don't know much about this part of the code at this point. Yet I can
clearly deduct that the length of the charLenTable is supposed to be the
classFactor of the SMModel. Therefore 2 classes were missing in
ISO2022JPCharLenTable, hence a buffer overflow happens when trying to
reach these. I am not sure of the values I should add there. For now,
let's set 0 to both, but adding also a comment so that I can review this
code later on, when I will get to read and understand this piece of code
in more depth.
2017-05-14 19:49:01 +02:00
Jehan
64efb1b24c Bug 101031 - Memory leak of nsSBCSGroupProber.
This manual incrementation code is just horrible and so error-prone.
Some day, we should make a cleaner loop to register all these
single-byte charset probers.
2017-05-14 18:24:11 +02:00
Jehan
119fed7e8d LangModels: add Swedish support.
Encodings: ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-15 and
WINDOWS-1252.
Test text from https://sv.wikipedia.org/wiki/Mölle
2016-09-28 22:42:13 +02:00
Jehan
d62154bd6e LangModels: add Slovene support.
Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and
MAC-CENTRALEUROPE.
Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet
2016-09-28 22:13:17 +02:00
Jehan
fbd2efdbe9 LangModels: Romanian support added.
Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852.
Test texts from https://ro.wikipedia.org/wiki/Danemarca
2016-09-28 19:57:50 +02:00
Jehan
a7525b404d LangModels: added support for Irish Gaelic.
Encodings: ISO-8859-1, ISO-8859-9, ISO-8859-15 and WINDOWS-1252.
Test text from:
https://ga.wikipedia.org/wiki/Gluais_théarmaí_seoltóireachta
2016-09-27 00:49:05 +02:00
Jehan
a3a271dfd5 LangModels: Estonian models created.
Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and
Windows-1257.
Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov
Windows-1257 and ISO-8859-13 are very close so I added quotation marks
(Jutumärgid) which are on codepoints only present in ISO-8859-13,
making both encoding apart.
2016-09-27 00:14:29 +02:00
Jehan
3c6d31f5c2 LangModels: new Croatian models.
Supports: ISO-8859-2, ISO-8859-13, ISO-8859-16, IBM852, Windows-1250
and MAC-CENTRALEUROPE.
Test text from https://hr.wikipedia.org/wiki/Brekinja
2016-09-26 01:32:49 +02:00
Jehan
05ba8555cd src: fix number of Single-Byte charset probers. 2016-09-25 14:02:39 +02:00
Jehan
f262b1d65b LangModels: add Italian support.
Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15
and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added
since other encoding end up similar as ISO-8859-1 for most common texts
(i.e. glyphs used in Italian are on the same codepoints on these other
encodings).
Test text from https://it.wikipedia.org/wiki/Architettura_longobarda
2016-09-21 18:52:09 +02:00
Jehan
6bbe7da1ac LangModels: add Finnish support.
I built models for ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-13,
ISO-8859-15 and WINDOWS-1252, which all contain Finnish letters.
Nevertheless most texts in these encoding end up the same (same
codepoints for the Finnish glyphs) so I keep only tests for ISO-8859-1
and UTF-8. Models for other encoding may still be useful when processing
texts with some symbols, etc.
2016-09-21 18:27:39 +02:00
Jehan
a59b1c9571 src: update documentation comments on the public API. 2016-09-21 17:36:17 +02:00
Jehan
3401ac70d0 LangModels: add Polish support.
With the following encodings: ISO-8859-2, ISO-8859-13, ISO-8859-16,
Windows-1250, IBM852, MAC-CENTRALEUROPE.
Test text from https://pl.wikipedia.org/wiki/Zofia_Holszańska
2016-09-21 17:30:15 +02:00
Jehan
5f9ec3aef0 LangModels: add support for Slovak.
Encodings are the same as Czech (Windows-1250, ISO-8859-2 and
Mac-CentralEurope) since the resource I found indicate they used the
same encodings historically.
Also it is to be noted that the test examples' encoding were already
properly detected through Czech's models so the languages are definitely
very close, even statistically. Nevertheless adding the right models
will work better and these get better scores. This will take all its
meaning when uchardet will also be used as a language detector (in some
not-too-far future, hopefully!).
Test text taken from: https://sk.wikipedia.org/wiki/Jupiter
2016-09-21 13:42:20 +02:00
Jehan
26e1cebad1 LangModels: add support for Czech.
Encodings: Windows-1250, ISO-8859-2, IBM852 and Mac-CentralEurope.
Other encodings are known to have been used for Czech: Kamenicky,
KOI-8 CS2 and Cork. But these are uncommon enough that I decided not
to support them (especially since I can't find them supported in iconv
either, or at least not under an alias which I could recognize).
This web page, which contents was made under the Public Domain, is a
good reference for encodings which were used historically for Czech and
Slovak: http://luki.sdf-eu.org/txt/cs-encodings-faq.html
2016-09-21 03:33:50 +02:00
Jehan
183092d048 src: fix non-guarded 'if' warning.
Not sure if this is useful to have the 'if (mDetectedCharset)' outside
the if block, but it won't hurt for sure in this specific case, so I
leave the current code logics as is.
The exact warning was:
nsUniversalDetector.cpp: In member function ‘virtual nsresult nsUniversalDetector::HandleData(const char*, PRUint32)’:
nsUniversalDetector.cpp:115:5: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
     if (aLen > 2)
     ^~
nsUniversalDetector.cpp:157:7: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘if’
       if (mDetectedCharset)
       ^~
2016-09-21 02:37:31 +02:00
Jehan
2700cf3a83 LangModels: support for Maltese / ISO-8859-3.
Test text from https://mt.wikipedia.org/wiki/Franza.
2016-09-21 02:11:31 +02:00
Jehan
b7aebfdfda LangModels: add support for Latvian | Lithuanian / ISO-8859-4 | ISO-8859-10.
Just realizing that these 2 language can also be encoded with these
charsets (even though ISO-8859-13 would appear to be more common…
maybe?). Anyway now the models are updated and can recognize texts
using these encoding for these languages.
Added some test files as well, which work great.
2016-09-21 00:27:16 +02:00
Jehan
e138839f07 LangModels: add support for Portuguese / ISO-8859-1.
I actually added also couples with ISO-8859-9, ISO-8859-15 and
Windows-1252. Nevertheless there are no differences on the main
characters related to Portuguese so differences will hardly be made
and detection will usually return ISO-8859-1 only.
2016-09-21 00:01:07 +02:00
Jehan
ea2f4dd40f LangModels: new support for Latvian / ISO-8859-13.
Test text extracted from: https://lv.wikipedia.org/wiki/Vinsents_van_Gogs
2016-09-20 23:29:53 +02:00
Jehan
7cb3dd9ddd LangModels: add support for Lithuanian / ISO-8859-13.
Test text extracted from https://lt.wikipedia.org/wiki/Vincent_van_Gogh.
2016-09-20 23:09:24 +02:00
Jehan
157de1dc65 src: the EUC-KR prober now returns "UHC" as encoding name.
"UHC" is the "Unified Hangul Code" (aka Windows-949 or CP949). It is
apparently "mostly" upward compatible with EUC-KR so returning UHC for
a strict EUC-KR document is usually not to be considered wrong.
Yet I can read that EUC-KR has its own way of representing hangul
syllables not available in precomposed form, and this is not supported
in UHC (since this latter has all possible precomposed syllables), hence
the "mostly" upward-compatibility.
My personal daily experience with Korean documents though is that I
encounter a lot of UHC-encoded files, probably because of predominance
of Microsoft operating systems, which spread this encoding.
So until we get 2 separate detection machines, let's just return EUC-KR
files as being "UHC".
2016-09-19 01:22:45 +02:00
Jehan
771d78b7df Update the URL links: uchardet is now a freedesktop project. 2016-07-20 01:47:50 +02:00
Jehan
210e52d99a LangModels: update the Greek language models.
I did this to improve the model after a user reported a Greek sutitle
badly detected (see commit e0eec3b).
It didn't help, but well... since I updated it with much more data from
Wikipedia. Let's just commit it!
2016-05-25 17:39:10 +02:00
Jehan
e0eec3bae8 src: give a little weight to "probable sequences".
Up to now, we were only considering positive sequences, which are
sequences of 2 characters which happen the most. Yet our data gather
4 categories of sequences (the last one being called "negative", since
they never happened in our data).
I will call the category below positive: probable sequences. They may
happen, yet not often. The last category could be called "neutral".
This seems to fix the detection of a user's subtitle example without
breaking any of our current unit tests.
Probably I should still review this whole logics more in details later.
2016-05-25 17:38:20 +02:00
Jehan
4287d3accc src: trailing whitespace removed. 2016-05-25 16:07:17 +02:00
Ilya Tumaykin
2a3e41a6c3
cmake: drop useless PACKAGE_NAME redefinition 2016-03-22 01:23:06 +03:00
Ilya Tumaykin
6db8b6f8fe
cmake: minor comment cleanups 2016-03-22 01:23:06 +03:00
Ilya Tumaykin
d0e7ddd8ab
cmake: fix library filename and SONAME
Make library filename respect the current uchardet version and
make library SONAME respect the current major version.
2016-03-22 01:23:05 +03:00
Ilya Tumaykin
ad647d2e0a
cmake: keep compiler definitions in one place 2016-03-22 01:23:05 +03:00
Ilya Tumaykin
29f18210b1
cmake: hardcode less 2016-03-22 01:23:04 +03:00
Ilya Tumaykin
7201835c98
cmake: export UCHARDET_LIBRARY to the topmost scope 2016-03-22 01:23:04 +03:00
Ilya Tumaykin
e7feb35627
cmake: rename UCHARDET_STATIC_{TARGET -> LIBRARY} for clarity 2016-03-22 01:23:04 +03:00
Ilya Tumaykin
1a1f4bfbd8
cmake: rename UCHARDET_{TARGET -> LIBRARY} for clarity 2016-03-22 01:23:03 +03:00
Ilya Tumaykin
31a53570d6
cmake: use GNUInstallDirs cmake module
Available in cmake >= 2.8.5.
2016-03-22 01:23:03 +03:00
Ilya Tumaykin
b44be77be6
cmake: uniform indent everywhere
Indent with tabs, remove leading/trailing blank lines and spaces.
2016-03-21 01:07:41 +03:00
Ricardo Constantino (:RiCON)
78b55ec9fe CMake: Fix regression in f53cb8c building in paths with spaces
Tested with Ninja and Make in Windows and Archlinux with paths
with and without spaces.
2016-03-18 03:37:12 +00:00
Jehan
fcc525a64f Merge pull request #25 from Coacher/master
cmake: purge remnants of opencc after b6d872bb
2016-03-17 19:10:39 +01:00
Jehan
d255184609 Merge pull request #24 from wiiaboo/ab-suite
Improving build with more options.

Building only static possible, uchardet command line tool build can be disabled, bindir can be customized…
2016-03-17 19:09:30 +01:00
Ricardo Constantino (:RiCON)
86755b1f57 CMake: Don't build static more than once 2016-03-16 19:31:00 +00:00
Ricardo Constantino (:RiCON)
b908b689a0 CMake: Add static lib destination to UCHARDET_TARGET 2016-03-16 19:30:54 +00:00
Ricardo Constantino (:RiCON)
81ed86a26b CMake: Use only CMAKE_INSTALL_BINDIR instead of DIR_BIN
This way it always shows up in ccmake, even if not defined.

A string is used instead of path because I personally think it makes more
sense in the following use-cases:

STRING:
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins
installs everything to /home/user/{lib,etc,share,(...)} and executables to
${CMAKE_INSTALL_PREFIX}/bins

-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin
everything to /home/user/{lib,etc,share,(...)} and executables to
/opt/bin

PATH:
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins
everything to /home/user/{lib,etc,share,(...)} and executables to
$(pwd)/bins (!)
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin
same as STRING
2016-03-16 19:11:33 +00:00
Ilya Tumaykin
aa4c2aeada
cmake: purge remnants of opencc after b6d872bb 2016-03-16 19:43:58 +03:00
Ricardo Constantino (:RiCON)
50b2e0802f CMake: Allow not building executable 2016-03-16 14:34:03 +00:00
Ricardo Constantino (:RiCON)
6500f09931 CMake: Allow building static-only builds
Add stdc++ to static libs in pkg-config
2016-03-16 14:30:15 +00:00
Ricardo Constantino (:RiCON)
f53cb8cddd CMake: fix linking with Ninja 2016-03-16 14:17:47 +00:00
Jehan
923d264470 LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15).
Test for ISO-8859-1 is disabled for now since the difference is not big
enough, as for characters used in Danish, between ISO-8859-1 and
ISO-8859-15. Therefore the first to be declared "wins".
Let's see to improve this later.
Test contents from:
https://da.wikipedia.org/wiki/Eurosymbol
https://da.wikipedia.org/wiki/Dansk_%28sprog%29
2016-02-19 19:10:41 +01:00
Jehan
98b5e52252 LangModels: add VISCII encoding support and retrain Vietnamese model. 2016-02-13 03:51:18 +01:00
Jehan
178c6119b8 LangModels: add Windows-1258 support for Vietnamese.
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
2016-02-13 02:32:57 +01:00
Jehan
248d6dbd35 tools: exit with non-zero value on uchardet error. 2016-01-21 18:16:42 +01:00
Jehan
9c3c37517c LangModels: add Arabic support.
Models constructed for ISO-8859-6 and Windows-1256.
2015-12-13 18:42:16 +01:00
Jehan
ad2f7212e2 LangModels: retraining Greek models with my training script.
This fixes our Greek/Windows-1253 test.
2015-12-13 18:02:11 +01:00
Jehan
ffabb65712 LangModels: adding Spanish support.
With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.
2015-12-12 18:54:35 +01:00
Jehan
a251753db8 LangModels: updating Hungarian language models. 2015-12-12 18:06:17 +01:00
Jehan
4c8316f9cf Nearly-ASCII text with NBSP is still not ASCII.
There is no "exception" in encoding. The non-breaking space 0xA0 is not
ASCII, and therefore returning "ASCII" will later create issues (for
instance trying to re-encode with iconv produces an error).
This was obviously an explicit decision in original code (according to
code comments), probably tied to specifity of the original program from
Mozilla. Now we want strict detection.
I will return "ISO-8859-1" for "nearly-ASCII texts with NBSP as only
exception" (note that I could have returned any ISO-8859 charsets since
they all have this character in common).
2015-12-05 21:11:29 +01:00
Jehan
e5234d6b61 Stating endianness of UTF-16 and UTF-32 was an error when BOM present.
According to RFC 2781, section 3.3: "Systems labelling UTF-16BE/LE text
MUST NOT prepend a BOM to the text."
Since uchardet cannot (and should not, obviously, it's not its role)
modify input text, when a BOM is present, we should always label the
encoding as "UTF-16" only.
Also it broke unit tests in using programs since a conversion from UTF-8
to UTF-16LE/BE would create a text without BOM, and a conversion from
UTF-16LE/BE to UTF-8 creates a UTF-8 text with a BOM, which changed
existing behaviours.
Same goes for UTF-32.
See also Unicode 5.0.0 standard, section 3.10 (tables 3.8 and 3.9 in
particular).
2015-12-04 19:19:39 +01:00
Jehan
5691dc59a1 LangModels: rename Cyrillic models to Russian models.
Our language models are per-lang, not per script.
2015-12-04 03:27:29 +01:00
Jehan
fb3c47a073 LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models.
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
2015-12-04 03:14:52 +01:00
Jehan
5ee1c3ee39 LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9. 2015-12-04 02:35:09 +01:00
Jehan
f0e122b506 LangModels: add Esperanto ISO-8859-3 language model. 2015-12-04 01:35:56 +01:00
Jehan
55b4f23971 Single Byte charsets: high ctrl character ratio lowers confidence.
Control characters are not an error per-se. Nevertheless they are clearly not
frequent in single-byte charset texts. It is only normal for them to lower
confidence in a charset. In particular a higher ctrl-per-letter ratio means
a lower confidence.
This fixes for instance our Windows-1252 German test (otherwise detected as
ISO-8859-1).
2015-12-04 00:04:43 +01:00
Jehan
aa587a64bd LangModels: adding German models for ISO-8859-1 and Windows-1252. 2015-12-03 23:58:41 +01:00
Jehan
0270b1e856 Adding French Windows-1252 support. 2015-12-03 21:22:30 +01:00
Jehan
ea34e8b1bd Update doc comment.
We do not return empty string on ASCII anymore. It means only detection
failure, now. ASCII will get a proper "ASCII" return.
2015-12-03 20:36:09 +01:00