74 Commits

Author SHA1 Message Date
Ilya Tumaykin
ad647d2e0a
cmake: keep compiler definitions in one place 2016-03-22 01:23:05 +03:00
Ilya Tumaykin
29f18210b1
cmake: hardcode less 2016-03-22 01:23:04 +03:00
Ilya Tumaykin
7201835c98
cmake: export UCHARDET_LIBRARY to the topmost scope 2016-03-22 01:23:04 +03:00
Ilya Tumaykin
e7feb35627
cmake: rename UCHARDET_STATIC_{TARGET -> LIBRARY} for clarity 2016-03-22 01:23:04 +03:00
Ilya Tumaykin
1a1f4bfbd8
cmake: rename UCHARDET_{TARGET -> LIBRARY} for clarity 2016-03-22 01:23:03 +03:00
Ilya Tumaykin
31a53570d6
cmake: use GNUInstallDirs cmake module
Available in cmake >= 2.8.5.
2016-03-22 01:23:03 +03:00
Ilya Tumaykin
b44be77be6
cmake: uniform indent everywhere
Indent with tabs, remove leading/trailing blank lines and spaces.
2016-03-21 01:07:41 +03:00
Ricardo Constantino (:RiCON)
78b55ec9fe CMake: Fix regression in f53cb8c building in paths with spaces
Tested with Ninja and Make in Windows and Archlinux with paths
with and without spaces.
2016-03-18 03:37:12 +00:00
Jehan
fcc525a64f Merge pull request #25 from Coacher/master
cmake: purge remnants of opencc after b6d872bb
2016-03-17 19:10:39 +01:00
Jehan
d255184609 Merge pull request #24 from wiiaboo/ab-suite
Improving build with more options.

Building only static possible, uchardet command line tool build can be disabled, bindir can be customized…
2016-03-17 19:09:30 +01:00
Ricardo Constantino (:RiCON)
86755b1f57 CMake: Don't build static more than once 2016-03-16 19:31:00 +00:00
Ricardo Constantino (:RiCON)
b908b689a0 CMake: Add static lib destination to UCHARDET_TARGET 2016-03-16 19:30:54 +00:00
Ricardo Constantino (:RiCON)
81ed86a26b CMake: Use only CMAKE_INSTALL_BINDIR instead of DIR_BIN
This way it always shows up in ccmake, even if not defined.

A string is used instead of path because I personally think it makes more
sense in the following use-cases:

STRING:
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins
installs everything to /home/user/{lib,etc,share,(...)} and executables to
${CMAKE_INSTALL_PREFIX}/bins

-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin
everything to /home/user/{lib,etc,share,(...)} and executables to
/opt/bin

PATH:
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins
everything to /home/user/{lib,etc,share,(...)} and executables to
$(pwd)/bins (!)
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin
same as STRING
2016-03-16 19:11:33 +00:00
Ilya Tumaykin
aa4c2aeada
cmake: purge remnants of opencc after b6d872bb 2016-03-16 19:43:58 +03:00
Ricardo Constantino (:RiCON)
50b2e0802f CMake: Allow not building executable 2016-03-16 14:34:03 +00:00
Ricardo Constantino (:RiCON)
6500f09931 CMake: Allow building static-only builds
Add stdc++ to static libs in pkg-config
2016-03-16 14:30:15 +00:00
Ricardo Constantino (:RiCON)
f53cb8cddd CMake: fix linking with Ninja 2016-03-16 14:17:47 +00:00
Jehan
923d264470 LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15).
Test for ISO-8859-1 is disabled for now since the difference is not big
enough, as for characters used in Danish, between ISO-8859-1 and
ISO-8859-15. Therefore the first to be declared "wins".
Let's see to improve this later.
Test contents from:
https://da.wikipedia.org/wiki/Eurosymbol
https://da.wikipedia.org/wiki/Dansk_%28sprog%29
2016-02-19 19:10:41 +01:00
Jehan
98b5e52252 LangModels: add VISCII encoding support and retrain Vietnamese model. 2016-02-13 03:51:18 +01:00
Jehan
178c6119b8 LangModels: add Windows-1258 support for Vietnamese.
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
2016-02-13 02:32:57 +01:00
Jehan
248d6dbd35 tools: exit with non-zero value on uchardet error. 2016-01-21 18:16:42 +01:00
Jehan
9c3c37517c LangModels: add Arabic support.
Models constructed for ISO-8859-6 and Windows-1256.
2015-12-13 18:42:16 +01:00
Jehan
ad2f7212e2 LangModels: retraining Greek models with my training script.
This fixes our Greek/Windows-1253 test.
2015-12-13 18:02:11 +01:00
Jehan
ffabb65712 LangModels: adding Spanish support.
With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.
2015-12-12 18:54:35 +01:00
Jehan
a251753db8 LangModels: updating Hungarian language models. 2015-12-12 18:06:17 +01:00
Jehan
4c8316f9cf Nearly-ASCII text with NBSP is still not ASCII.
There is no "exception" in encoding. The non-breaking space 0xA0 is not
ASCII, and therefore returning "ASCII" will later create issues (for
instance trying to re-encode with iconv produces an error).
This was obviously an explicit decision in original code (according to
code comments), probably tied to specifity of the original program from
Mozilla. Now we want strict detection.
I will return "ISO-8859-1" for "nearly-ASCII texts with NBSP as only
exception" (note that I could have returned any ISO-8859 charsets since
they all have this character in common).
2015-12-05 21:11:29 +01:00
Jehan
e5234d6b61 Stating endianness of UTF-16 and UTF-32 was an error when BOM present.
According to RFC 2781, section 3.3: "Systems labelling UTF-16BE/LE text
MUST NOT prepend a BOM to the text."
Since uchardet cannot (and should not, obviously, it's not its role)
modify input text, when a BOM is present, we should always label the
encoding as "UTF-16" only.
Also it broke unit tests in using programs since a conversion from UTF-8
to UTF-16LE/BE would create a text without BOM, and a conversion from
UTF-16LE/BE to UTF-8 creates a UTF-8 text with a BOM, which changed
existing behaviours.
Same goes for UTF-32.
See also Unicode 5.0.0 standard, section 3.10 (tables 3.8 and 3.9 in
particular).
2015-12-04 19:19:39 +01:00
Jehan
5691dc59a1 LangModels: rename Cyrillic models to Russian models.
Our language models are per-lang, not per script.
2015-12-04 03:27:29 +01:00
Jehan
fb3c47a073 LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models.
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
2015-12-04 03:14:52 +01:00
Jehan
5ee1c3ee39 LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9. 2015-12-04 02:35:09 +01:00
Jehan
f0e122b506 LangModels: add Esperanto ISO-8859-3 language model. 2015-12-04 01:35:56 +01:00
Jehan
55b4f23971 Single Byte charsets: high ctrl character ratio lowers confidence.
Control characters are not an error per-se. Nevertheless they are clearly not
frequent in single-byte charset texts. It is only normal for them to lower
confidence in a charset. In particular a higher ctrl-per-letter ratio means
a lower confidence.
This fixes for instance our Windows-1252 German test (otherwise detected as
ISO-8859-1).
2015-12-04 00:04:43 +01:00
Jehan
aa587a64bd LangModels: adding German models for ISO-8859-1 and Windows-1252. 2015-12-03 23:58:41 +01:00
Jehan
0270b1e856 Adding French Windows-1252 support. 2015-12-03 21:22:30 +01:00
Jehan
ea34e8b1bd Update doc comment.
We do not return empty string on ASCII anymore. It means only detection
failure, now. ASCII will get a proper "ASCII" return.
2015-12-03 20:36:09 +01:00
Jehan
ba56d91808 Update uchardet URL in various places. 2015-12-03 19:48:29 +01:00
Jehan
d1bc09e4d7 Update authors.
I think I deserved being listed in the authors by now. ;-)
2015-12-03 19:44:13 +01:00
Jehan
c4fa728e7a Merge branch 'master' of https://github.com/lovasoa/uchardet into lovasoa-master
Let's shortcut Single Byte charset detection on invalid codepoints.
Merging and fixing the contributor's commit conflicts after code
redesign: in particular we added an illegal character concept (they were
mixed with control characters in current charmaps. Yet ctrl characters
are NOT to be considered invalid) and constants instead of hardcoded
numbers ('ILL' rather than 255).
2015-12-03 19:26:19 +01:00
Jehan
d686fcc1cd LangModels: add illegal codepoints information on single byte charmaps. 2015-12-03 19:04:07 +01:00
Jehan
683255278d Re-enable Hungarian language models.
Now that we have at least one model for ISO-8859-1, the risk of
detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.
2015-12-02 22:24:36 +01:00
Jehan
4f1c3ff85e nsSBCharSetProber: multiply confidence by ratio of positive seqs per chars.
If all sequences in a text are positive sequences, the ratio of positive
sequences cannot make the difference between 2 very close charsets.
A ratio of positive sequences per letters on the other hand will
change a tie between 2 encoding. If while adding a letter, the number
of positive sequences does not increase, the confidence will decrease
(corresponding to the fact it was likely not a letter).
On the other hand, if the number of positive sequences increase, so
will the confidence.
For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15.
When letters only available in ISO-8859-15 appear in a text, we expect
confidence to tilt towards the close yet slightly different ISO-8859-15.
2015-11-30 19:52:07 +01:00
Jehan
9cb5764b73 LangModels: update the French language models.
Fully built with the script.
2015-11-30 19:20:55 +01:00
Jehan
dbb4c1d2ff nsSBCharSetProber: replace the fixed 64 SAMPLE_SIZE...
... with per-language model "frequent character" count.
2015-11-29 23:51:55 +01:00
Jehan
0289c2a232 Differentiate ASCII and detection failure.
The lib used to return "" for both properly detected ASCII and
detection failure. And the tool would return "ascii/unknown".
Make a proper distinction between the 2 cases.
2015-11-28 17:04:52 +01:00
Jehan
005fd98086 Add initial support for French with ISO-8859-1 and ISO-8859-15.
Mostly generated with a script from Wikipedia data (only the typical
positive ratio is slightly modified).
This is a first test before adding my generating script to the main tree.
2015-11-28 02:14:39 +01:00
Jehan
2106173546 Move all Single-Byte language models to a subdirectory. 2015-11-27 23:11:23 +01:00
Jehan
984d8f7b09 Add language information in model names when they were missing.
Models are language specific (there could be several models for the same
charset but different languages). Let's have a clear naming scheme.
2015-11-27 18:21:13 +01:00
Jehan
42b91898da Create 3-letter constants for special charmap characters.
Control characters, carriage, symbols and numbers.
Also add a constant for illegal characters (not used for now).
This will allow easier processing and charmap reading.
2015-11-27 17:41:54 +01:00
Ophir LOJKINE
5ef60164fc Stop detection early on control characters 2015-11-24 22:07:41 +03:00
Jehan
e8dd55995a Add "LE/BE" suffix to "UTF-16" result for Little/Big Endian info...
... and add UTF-32 BOM detection.
2015-11-24 18:50:23 +01:00