Jehan
e0eec3bae8
src: give a little weight to "probable sequences".
...
Up to now, we were only considering positive sequences, which are
sequences of 2 characters which happen the most. Yet our data gather
4 categories of sequences (the last one being called "negative", since
they never happened in our data).
I will call the category below positive: probable sequences. They may
happen, yet not often. The last category could be called "neutral".
This seems to fix the detection of a user's subtitle example without
breaking any of our current unit tests.
Probably I should still review this whole logics more in details later.
2016-05-25 17:38:20 +02:00
Jehan
4287d3accc
src: trailing whitespace removed.
2016-05-25 16:07:17 +02:00
Jehan
6cd8c322ad
script: stupid bug on BuildLangModel.py.
2016-05-25 15:23:36 +02:00
Jehan
fb1d544007
pkg-config: use GNUInstallDirs CMAKE_ variables in pc.in template.
2016-03-27 20:31:58 +02:00
Jehan
74b4f6a62b
Merge pull request #30 from Coacher/use-gnuinstalldirs-cmake-module
...
Use GNUInstallDirs cmake module, fix library filename bug, minor cleanups.
2016-03-27 20:31:17 +02:00
Ilya Tumaykin
2a3e41a6c3
cmake: drop useless PACKAGE_NAME redefinition
2016-03-22 01:23:06 +03:00
Ilya Tumaykin
6db8b6f8fe
cmake: minor comment cleanups
2016-03-22 01:23:06 +03:00
Ilya Tumaykin
d0e7ddd8ab
cmake: fix library filename and SONAME
...
Make library filename respect the current uchardet version and
make library SONAME respect the current major version.
2016-03-22 01:23:05 +03:00
Ilya Tumaykin
dbeee08335
cmake: use lowercase suffix for debug build
2016-03-22 01:23:05 +03:00
Ilya Tumaykin
ad647d2e0a
cmake: keep compiler definitions in one place
2016-03-22 01:23:05 +03:00
Ilya Tumaykin
29f18210b1
cmake: hardcode less
2016-03-22 01:23:04 +03:00
Ilya Tumaykin
7201835c98
cmake: export UCHARDET_LIBRARY to the topmost scope
2016-03-22 01:23:04 +03:00
Ilya Tumaykin
e7feb35627
cmake: rename UCHARDET_STATIC_{TARGET -> LIBRARY} for clarity
2016-03-22 01:23:04 +03:00
Ilya Tumaykin
1a1f4bfbd8
cmake: rename UCHARDET_{TARGET -> LIBRARY} for clarity
2016-03-22 01:23:03 +03:00
Ilya Tumaykin
31a53570d6
cmake: use GNUInstallDirs cmake module
...
Available in cmake >= 2.8.5.
2016-03-22 01:23:03 +03:00
Ilya Tumaykin
d0e29dc934
cmake: bump the minimum version to 2.8.5
...
Required for the GNUInstallDirs cmake module. See the next commit.
2016-03-22 01:21:58 +03:00
Jehan
ad7db2769e
Merge pull request #26 from Coacher/uniform-indent
...
cmake: uniform indent everywhere.
2016-03-21 00:22:19 +01:00
Ilya Tumaykin
b44be77be6
cmake: uniform indent everywhere
...
Indent with tabs, remove leading/trailing blank lines and spaces.
2016-03-21 01:07:41 +03:00
Jehan
b88a66f3f1
Merge pull request #28 from Coacher/cmake-updates
...
cmake: use PACKAGE_NAME variable instead of hardcoding it.
2016-03-19 14:24:52 +01:00
Carbo Kuo
e28dfe3776
Merge pull request #29 from wiiaboo/ab-suite
...
CMake: Fix regression in f53cb8c building in paths with spaces
2016-03-18 16:31:31 +01:00
Ricardo Constantino (:RiCON)
78b55ec9fe
CMake: Fix regression in f53cb8c building in paths with spaces
...
Tested with Ninja and Make in Windows and Archlinux with paths
with and without spaces.
2016-03-18 03:37:12 +00:00
Ilya Tumaykin
6c1e310f9b
cmake: hardcode less
2016-03-18 02:56:21 +03:00
Jehan
fcc525a64f
Merge pull request #25 from Coacher/master
...
cmake: purge remnants of opencc after b6d872bb
2016-03-17 19:10:39 +01:00
Jehan
d255184609
Merge pull request #24 from wiiaboo/ab-suite
...
Improving build with more options.
Building only static possible, uchardet command line tool build can be disabled, bindir can be customized…
2016-03-17 19:09:30 +01:00
Ricardo Constantino (:RiCON)
86755b1f57
CMake: Don't build static more than once
2016-03-16 19:31:00 +00:00
Ricardo Constantino (:RiCON)
b908b689a0
CMake: Add static lib destination to UCHARDET_TARGET
2016-03-16 19:30:54 +00:00
Ricardo Constantino (:RiCON)
81ed86a26b
CMake: Use only CMAKE_INSTALL_BINDIR instead of DIR_BIN
...
This way it always shows up in ccmake, even if not defined.
A string is used instead of path because I personally think it makes more
sense in the following use-cases:
STRING:
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins
installs everything to /home/user/{lib,etc,share,(...)} and executables to
${CMAKE_INSTALL_PREFIX}/bins
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin
everything to /home/user/{lib,etc,share,(...)} and executables to
/opt/bin
PATH:
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins
everything to /home/user/{lib,etc,share,(...)} and executables to
$(pwd)/bins (!)
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin
same as STRING
2016-03-16 19:11:33 +00:00
Ilya Tumaykin
aa4c2aeada
cmake: purge remnants of opencc after b6d872bb
2016-03-16 19:43:58 +03:00
Ricardo Constantino (:RiCON)
50b2e0802f
CMake: Allow not building executable
2016-03-16 14:34:03 +00:00
Ricardo Constantino (:RiCON)
6500f09931
CMake: Allow building static-only builds
...
Add stdc++ to static libs in pkg-config
2016-03-16 14:30:15 +00:00
Ricardo Constantino (:RiCON)
f53cb8cddd
CMake: fix linking with Ninja
2016-03-16 14:17:47 +00:00
Ricardo Constantino (:RiCON)
36665da832
CMake: allow installing binary to non-default dir
2016-03-16 14:17:25 +00:00
Jehan
198190461e
script: move the Wikipedia title syntax cleaning to BuildLangModel.py.
2016-02-21 16:20:22 +01:00
Jehan
d24bd7d578
script: Wikipedia API's python wrapper does not return garbage text anymore.
...
I can't see new commits since 2014. So I am assuming the issue was on
Wikipedia side and that it has been fixed.
2016-02-21 16:07:10 +01:00
Jehan
37024460fe
script: add a README file dedicated to adding new support.
2016-02-21 16:06:11 +01:00
Jehan
42c6b42f65
Add a DOAP file.
...
All URLs are still referring to the github project, because we have
no other homepage or bug tracker yet.
2016-02-21 15:19:50 +01:00
Jehan
d5dba26e04
README: add Danish support for 3 charsets.
2016-02-19 19:11:56 +01:00
Jehan
923d264470
LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15).
...
Test for ISO-8859-1 is disabled for now since the difference is not big
enough, as for characters used in Danish, between ISO-8859-1 and
ISO-8859-15. Therefore the first to be declared "wins".
Let's see to improve this later.
Test contents from:
https://da.wikipedia.org/wiki/Eurosymbol
https://da.wikipedia.org/wiki/Dansk_%28sprog%29
2016-02-19 19:10:41 +01:00
Jehan
1694999bce
README: update with VISCII support.
2016-02-13 03:52:06 +01:00
Jehan
98b5e52252
LangModels: add VISCII encoding support and retrain Vietnamese model.
2016-02-13 03:51:18 +01:00
Jehan
600cf76a76
BuildLangModel: try using iconv for conversion when support missing...
...
... in python. For instance I had the case where the VISCII encoding is
supported by iconv but not by encode/decode() function in core python.
2016-02-13 03:47:41 +01:00
Jehan
178c6119b8
LangModels: add Windows-1258 support for Vietnamese.
...
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
2016-02-13 02:32:57 +01:00
Jehan
27135a8880
BuildLangModel: printing a message when discarding a page.
2016-02-13 02:27:15 +01:00
Jehan
0446e24c8d
README: uchardet now available on Fedora.
...
Already in Fedora devel and soon to be added as update on Fedora 23,
if I get it correctly. See:
https://bugzilla.redhat.com/show_bug.cgi?id=1264713
https://admin.fedoraproject.org/pkgdb/package/rpms/uchardet/
2016-02-12 17:53:22 +01:00
Jehan
248d6dbd35
tools: exit with non-zero value on uchardet error.
2016-01-21 18:16:42 +01:00
Jehan
b6d872bbec
app: package name wrong in CMakeLists.txt.
...
Probably coming from a copy-paste error when the build system was
originally created.
2015-12-15 21:40:16 +01:00
Jehan
706023139c
tests: add test files for Arabic.
...
Text taken from:
https://ar.wikipedia.org/wiki/%D9%88%D9%8A%D9%86%D8%AF%D9%88%D8%B2-1256
2015-12-13 18:42:59 +01:00
Jehan
9c3c37517c
LangModels: add Arabic support.
...
Models constructed for ISO-8859-6 and Windows-1256.
2015-12-13 18:42:16 +01:00
Jehan
ad2f7212e2
LangModels: retraining Greek models with my training script.
...
This fixes our Greek/Windows-1253 test.
2015-12-13 18:02:11 +01:00
Jehan
1b4c62ac21
tests: test files for Spanish.
...
I disable only ISO-8859-15 which is similar to ISO-8859-1 for all
Spanish letters. Unfortunately illegal codepoints are similar too.
Difference should likely be done on symbols (like the euro symbol)
but our current algorithm does nothing about this for charset
comparison.
Text from https://es.wikipedia.org/wiki/España
2015-12-12 18:55:43 +01:00