98 Commits

Author SHA1 Message Date
Jehan
a7525b404d LangModels: added support for Irish Gaelic.
Encodings: ISO-8859-1, ISO-8859-9, ISO-8859-15 and WINDOWS-1252.
Test text from:
https://ga.wikipedia.org/wiki/Gluais_théarmaí_seoltóireachta
2016-09-27 00:49:05 +02:00
Jehan
a3a271dfd5 LangModels: Estonian models created.
Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and
Windows-1257.
Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov
Windows-1257 and ISO-8859-13 are very close so I added quotation marks
(Jutumärgid) which are on codepoints only present in ISO-8859-13,
making both encoding apart.
2016-09-27 00:14:29 +02:00
Jehan
3c6d31f5c2 LangModels: new Croatian models.
Supports: ISO-8859-2, ISO-8859-13, ISO-8859-16, IBM852, Windows-1250
and MAC-CENTRALEUROPE.
Test text from https://hr.wikipedia.org/wiki/Brekinja
2016-09-26 01:32:49 +02:00
Jehan
05ba8555cd src: fix number of Single-Byte charset probers. 2016-09-25 14:02:39 +02:00
Jehan
f262b1d65b LangModels: add Italian support.
Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15
and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added
since other encoding end up similar as ISO-8859-1 for most common texts
(i.e. glyphs used in Italian are on the same codepoints on these other
encodings).
Test text from https://it.wikipedia.org/wiki/Architettura_longobarda
2016-09-21 18:52:09 +02:00
Jehan
6bbe7da1ac LangModels: add Finnish support.
I built models for ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-13,
ISO-8859-15 and WINDOWS-1252, which all contain Finnish letters.
Nevertheless most texts in these encoding end up the same (same
codepoints for the Finnish glyphs) so I keep only tests for ISO-8859-1
and UTF-8. Models for other encoding may still be useful when processing
texts with some symbols, etc.
2016-09-21 18:27:39 +02:00
Jehan
a59b1c9571 src: update documentation comments on the public API. 2016-09-21 17:36:17 +02:00
Jehan
3401ac70d0 LangModels: add Polish support.
With the following encodings: ISO-8859-2, ISO-8859-13, ISO-8859-16,
Windows-1250, IBM852, MAC-CENTRALEUROPE.
Test text from https://pl.wikipedia.org/wiki/Zofia_Holszańska
2016-09-21 17:30:15 +02:00
Jehan
5f9ec3aef0 LangModels: add support for Slovak.
Encodings are the same as Czech (Windows-1250, ISO-8859-2 and
Mac-CentralEurope) since the resource I found indicate they used the
same encodings historically.
Also it is to be noted that the test examples' encoding were already
properly detected through Czech's models so the languages are definitely
very close, even statistically. Nevertheless adding the right models
will work better and these get better scores. This will take all its
meaning when uchardet will also be used as a language detector (in some
not-too-far future, hopefully!).
Test text taken from: https://sk.wikipedia.org/wiki/Jupiter
2016-09-21 13:42:20 +02:00
Jehan
26e1cebad1 LangModels: add support for Czech.
Encodings: Windows-1250, ISO-8859-2, IBM852 and Mac-CentralEurope.
Other encodings are known to have been used for Czech: Kamenicky,
KOI-8 CS2 and Cork. But these are uncommon enough that I decided not
to support them (especially since I can't find them supported in iconv
either, or at least not under an alias which I could recognize).
This web page, which contents was made under the Public Domain, is a
good reference for encodings which were used historically for Czech and
Slovak: http://luki.sdf-eu.org/txt/cs-encodings-faq.html
2016-09-21 03:33:50 +02:00
Jehan
183092d048 src: fix non-guarded 'if' warning.
Not sure if this is useful to have the 'if (mDetectedCharset)' outside
the if block, but it won't hurt for sure in this specific case, so I
leave the current code logics as is.
The exact warning was:
nsUniversalDetector.cpp: In member function ‘virtual nsresult nsUniversalDetector::HandleData(const char*, PRUint32)’:
nsUniversalDetector.cpp:115:5: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
     if (aLen > 2)
     ^~
nsUniversalDetector.cpp:157:7: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘if’
       if (mDetectedCharset)
       ^~
2016-09-21 02:37:31 +02:00
Jehan
2700cf3a83 LangModels: support for Maltese / ISO-8859-3.
Test text from https://mt.wikipedia.org/wiki/Franza.
2016-09-21 02:11:31 +02:00
Jehan
b7aebfdfda LangModels: add support for Latvian | Lithuanian / ISO-8859-4 | ISO-8859-10.
Just realizing that these 2 language can also be encoded with these
charsets (even though ISO-8859-13 would appear to be more common…
maybe?). Anyway now the models are updated and can recognize texts
using these encoding for these languages.
Added some test files as well, which work great.
2016-09-21 00:27:16 +02:00
Jehan
e138839f07 LangModels: add support for Portuguese / ISO-8859-1.
I actually added also couples with ISO-8859-9, ISO-8859-15 and
Windows-1252. Nevertheless there are no differences on the main
characters related to Portuguese so differences will hardly be made
and detection will usually return ISO-8859-1 only.
2016-09-21 00:01:07 +02:00
Jehan
ea2f4dd40f LangModels: new support for Latvian / ISO-8859-13.
Test text extracted from: https://lv.wikipedia.org/wiki/Vinsents_van_Gogs
2016-09-20 23:29:53 +02:00
Jehan
7cb3dd9ddd LangModels: add support for Lithuanian / ISO-8859-13.
Test text extracted from https://lt.wikipedia.org/wiki/Vincent_van_Gogh.
2016-09-20 23:09:24 +02:00
Jehan
157de1dc65 src: the EUC-KR prober now returns "UHC" as encoding name.
"UHC" is the "Unified Hangul Code" (aka Windows-949 or CP949). It is
apparently "mostly" upward compatible with EUC-KR so returning UHC for
a strict EUC-KR document is usually not to be considered wrong.
Yet I can read that EUC-KR has its own way of representing hangul
syllables not available in precomposed form, and this is not supported
in UHC (since this latter has all possible precomposed syllables), hence
the "mostly" upward-compatibility.
My personal daily experience with Korean documents though is that I
encounter a lot of UHC-encoded files, probably because of predominance
of Microsoft operating systems, which spread this encoding.
So until we get 2 separate detection machines, let's just return EUC-KR
files as being "UHC".
2016-09-19 01:22:45 +02:00
Jehan
771d78b7df Update the URL links: uchardet is now a freedesktop project. 2016-07-20 01:47:50 +02:00
Jehan
210e52d99a LangModels: update the Greek language models.
I did this to improve the model after a user reported a Greek sutitle
badly detected (see commit e0eec3b).
It didn't help, but well... since I updated it with much more data from
Wikipedia. Let's just commit it!
2016-05-25 17:39:10 +02:00
Jehan
e0eec3bae8 src: give a little weight to "probable sequences".
Up to now, we were only considering positive sequences, which are
sequences of 2 characters which happen the most. Yet our data gather
4 categories of sequences (the last one being called "negative", since
they never happened in our data).
I will call the category below positive: probable sequences. They may
happen, yet not often. The last category could be called "neutral".
This seems to fix the detection of a user's subtitle example without
breaking any of our current unit tests.
Probably I should still review this whole logics more in details later.
2016-05-25 17:38:20 +02:00
Jehan
4287d3accc src: trailing whitespace removed. 2016-05-25 16:07:17 +02:00
Ilya Tumaykin
2a3e41a6c3
cmake: drop useless PACKAGE_NAME redefinition 2016-03-22 01:23:06 +03:00
Ilya Tumaykin
6db8b6f8fe
cmake: minor comment cleanups 2016-03-22 01:23:06 +03:00
Ilya Tumaykin
d0e7ddd8ab
cmake: fix library filename and SONAME
Make library filename respect the current uchardet version and
make library SONAME respect the current major version.
2016-03-22 01:23:05 +03:00
Ilya Tumaykin
ad647d2e0a
cmake: keep compiler definitions in one place 2016-03-22 01:23:05 +03:00
Ilya Tumaykin
29f18210b1
cmake: hardcode less 2016-03-22 01:23:04 +03:00
Ilya Tumaykin
7201835c98
cmake: export UCHARDET_LIBRARY to the topmost scope 2016-03-22 01:23:04 +03:00
Ilya Tumaykin
e7feb35627
cmake: rename UCHARDET_STATIC_{TARGET -> LIBRARY} for clarity 2016-03-22 01:23:04 +03:00
Ilya Tumaykin
1a1f4bfbd8
cmake: rename UCHARDET_{TARGET -> LIBRARY} for clarity 2016-03-22 01:23:03 +03:00
Ilya Tumaykin
31a53570d6
cmake: use GNUInstallDirs cmake module
Available in cmake >= 2.8.5.
2016-03-22 01:23:03 +03:00
Ilya Tumaykin
b44be77be6
cmake: uniform indent everywhere
Indent with tabs, remove leading/trailing blank lines and spaces.
2016-03-21 01:07:41 +03:00
Ricardo Constantino (:RiCON)
78b55ec9fe CMake: Fix regression in f53cb8c building in paths with spaces
Tested with Ninja and Make in Windows and Archlinux with paths
with and without spaces.
2016-03-18 03:37:12 +00:00
Jehan
fcc525a64f Merge pull request #25 from Coacher/master
cmake: purge remnants of opencc after b6d872bb
2016-03-17 19:10:39 +01:00
Jehan
d255184609 Merge pull request #24 from wiiaboo/ab-suite
Improving build with more options.

Building only static possible, uchardet command line tool build can be disabled, bindir can be customized…
2016-03-17 19:09:30 +01:00
Ricardo Constantino (:RiCON)
86755b1f57 CMake: Don't build static more than once 2016-03-16 19:31:00 +00:00
Ricardo Constantino (:RiCON)
b908b689a0 CMake: Add static lib destination to UCHARDET_TARGET 2016-03-16 19:30:54 +00:00
Ricardo Constantino (:RiCON)
81ed86a26b CMake: Use only CMAKE_INSTALL_BINDIR instead of DIR_BIN
This way it always shows up in ccmake, even if not defined.

A string is used instead of path because I personally think it makes more
sense in the following use-cases:

STRING:
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins
installs everything to /home/user/{lib,etc,share,(...)} and executables to
${CMAKE_INSTALL_PREFIX}/bins

-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin
everything to /home/user/{lib,etc,share,(...)} and executables to
/opt/bin

PATH:
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins
everything to /home/user/{lib,etc,share,(...)} and executables to
$(pwd)/bins (!)
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin
same as STRING
2016-03-16 19:11:33 +00:00
Ilya Tumaykin
aa4c2aeada
cmake: purge remnants of opencc after b6d872bb 2016-03-16 19:43:58 +03:00
Ricardo Constantino (:RiCON)
50b2e0802f CMake: Allow not building executable 2016-03-16 14:34:03 +00:00
Ricardo Constantino (:RiCON)
6500f09931 CMake: Allow building static-only builds
Add stdc++ to static libs in pkg-config
2016-03-16 14:30:15 +00:00
Ricardo Constantino (:RiCON)
f53cb8cddd CMake: fix linking with Ninja 2016-03-16 14:17:47 +00:00
Jehan
923d264470 LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15).
Test for ISO-8859-1 is disabled for now since the difference is not big
enough, as for characters used in Danish, between ISO-8859-1 and
ISO-8859-15. Therefore the first to be declared "wins".
Let's see to improve this later.
Test contents from:
https://da.wikipedia.org/wiki/Eurosymbol
https://da.wikipedia.org/wiki/Dansk_%28sprog%29
2016-02-19 19:10:41 +01:00
Jehan
98b5e52252 LangModels: add VISCII encoding support and retrain Vietnamese model. 2016-02-13 03:51:18 +01:00
Jehan
178c6119b8 LangModels: add Windows-1258 support for Vietnamese.
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
2016-02-13 02:32:57 +01:00
Jehan
248d6dbd35 tools: exit with non-zero value on uchardet error. 2016-01-21 18:16:42 +01:00
Jehan
9c3c37517c LangModels: add Arabic support.
Models constructed for ISO-8859-6 and Windows-1256.
2015-12-13 18:42:16 +01:00
Jehan
ad2f7212e2 LangModels: retraining Greek models with my training script.
This fixes our Greek/Windows-1253 test.
2015-12-13 18:02:11 +01:00
Jehan
ffabb65712 LangModels: adding Spanish support.
With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.
2015-12-12 18:54:35 +01:00
Jehan
a251753db8 LangModels: updating Hungarian language models. 2015-12-12 18:06:17 +01:00
Jehan
4c8316f9cf Nearly-ASCII text with NBSP is still not ASCII.
There is no "exception" in encoding. The non-breaking space 0xA0 is not
ASCII, and therefore returning "ASCII" will later create issues (for
instance trying to re-encode with iconv produces an error).
This was obviously an explicit decision in original code (according to
code comments), probably tied to specifity of the original program from
Mozilla. Now we want strict detection.
I will return "ISO-8859-1" for "nearly-ASCII texts with NBSP as only
exception" (note that I could have returned any ISO-8859 charsets since
they all have this character in common).
2015-12-05 21:11:29 +01:00