uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-02-05 17:30:09 +08:00

Author	SHA1	Message	Date
Ilya Tumaykin	dbeee08335	cmake: use lowercase suffix for debug build	2016-03-22 01:23:05 +03:00
Ilya Tumaykin	ad647d2e0a	cmake: keep compiler definitions in one place	2016-03-22 01:23:05 +03:00
Ilya Tumaykin	29f18210b1	cmake: hardcode less	2016-03-22 01:23:04 +03:00
Ilya Tumaykin	7201835c98	cmake: export UCHARDET_LIBRARY to the topmost scope	2016-03-22 01:23:04 +03:00
Ilya Tumaykin	e7feb35627	cmake: rename UCHARDET_STATIC_{TARGET -> LIBRARY} for clarity	2016-03-22 01:23:04 +03:00
Ilya Tumaykin	1a1f4bfbd8	cmake: rename UCHARDET_{TARGET -> LIBRARY} for clarity	2016-03-22 01:23:03 +03:00
Ilya Tumaykin	31a53570d6	cmake: use GNUInstallDirs cmake module Available in cmake >= 2.8.5.	2016-03-22 01:23:03 +03:00
Ilya Tumaykin	d0e29dc934	cmake: bump the minimum version to 2.8.5 Required for the GNUInstallDirs cmake module. See the next commit.	2016-03-22 01:21:58 +03:00
Jehan	ad7db2769e	Merge pull request #26 from Coacher/uniform-indent cmake: uniform indent everywhere.	2016-03-21 00:22:19 +01:00
Ilya Tumaykin	b44be77be6	cmake: uniform indent everywhere Indent with tabs, remove leading/trailing blank lines and spaces.	2016-03-21 01:07:41 +03:00
Jehan	b88a66f3f1	Merge pull request #28 from Coacher/cmake-updates cmake: use PACKAGE_NAME variable instead of hardcoding it.	2016-03-19 14:24:52 +01:00
Carbo Kuo	e28dfe3776	Merge pull request #29 from wiiaboo/ab-suite CMake: Fix regression in f53cb8c building in paths with spaces	2016-03-18 16:31:31 +01:00
Ricardo Constantino (:RiCON)	78b55ec9fe	CMake: Fix regression in f53cb8c building in paths with spaces Tested with Ninja and Make in Windows and Archlinux with paths with and without spaces.	2016-03-18 03:37:12 +00:00
Ilya Tumaykin	6c1e310f9b	cmake: hardcode less	2016-03-18 02:56:21 +03:00
Jehan	fcc525a64f	Merge pull request #25 from Coacher/master cmake: purge remnants of opencc after b6d872bb	2016-03-17 19:10:39 +01:00
Jehan	d255184609	Merge pull request #24 from wiiaboo/ab-suite Improving build with more options. Building only static possible, uchardet command line tool build can be disabled, bindir can be customized…	2016-03-17 19:09:30 +01:00
Ricardo Constantino (:RiCON)	86755b1f57	CMake: Don't build static more than once	2016-03-16 19:31:00 +00:00
Ricardo Constantino (:RiCON)	b908b689a0	CMake: Add static lib destination to UCHARDET_TARGET	2016-03-16 19:30:54 +00:00
Ricardo Constantino (:RiCON)	81ed86a26b	CMake: Use only CMAKE_INSTALL_BINDIR instead of DIR_BIN This way it always shows up in ccmake, even if not defined. A string is used instead of path because I personally think it makes more sense in the following use-cases: STRING: -DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins installs everything to /home/user/{lib,etc,share,(...)} and executables to ${CMAKE_INSTALL_PREFIX}/bins -DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin everything to /home/user/{lib,etc,share,(...)} and executables to /opt/bin PATH: -DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins everything to /home/user/{lib,etc,share,(...)} and executables to $(pwd)/bins (!) -DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin same as STRING	2016-03-16 19:11:33 +00:00
Ilya Tumaykin	aa4c2aeada	cmake: purge remnants of opencc after b6d872bb	2016-03-16 19:43:58 +03:00
Ricardo Constantino (:RiCON)	50b2e0802f	CMake: Allow not building executable	2016-03-16 14:34:03 +00:00
Ricardo Constantino (:RiCON)	6500f09931	CMake: Allow building static-only builds Add stdc++ to static libs in pkg-config	2016-03-16 14:30:15 +00:00
Ricardo Constantino (:RiCON)	f53cb8cddd	CMake: fix linking with Ninja	2016-03-16 14:17:47 +00:00
Ricardo Constantino (:RiCON)	36665da832	CMake: allow installing binary to non-default dir	2016-03-16 14:17:25 +00:00
Jehan	198190461e	script: move the Wikipedia title syntax cleaning to BuildLangModel.py.	2016-02-21 16:20:22 +01:00
Jehan	d24bd7d578	script: Wikipedia API's python wrapper does not return garbage text anymore. I can't see new commits since 2014. So I am assuming the issue was on Wikipedia side and that it has been fixed.	2016-02-21 16:07:10 +01:00
Jehan	37024460fe	script: add a README file dedicated to adding new support.	2016-02-21 16:06:11 +01:00
Jehan	42c6b42f65	Add a DOAP file. All URLs are still referring to the github project, because we have no other homepage or bug tracker yet.	2016-02-21 15:19:50 +01:00
Jehan	d5dba26e04	README: add Danish support for 3 charsets.	2016-02-19 19:11:56 +01:00
Jehan	923d264470	LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15). Test for ISO-8859-1 is disabled for now since the difference is not big enough, as for characters used in Danish, between ISO-8859-1 and ISO-8859-15. Therefore the first to be declared "wins". Let's see to improve this later. Test contents from: https://da.wikipedia.org/wiki/Eurosymbol https://da.wikipedia.org/wiki/Dansk_%28sprog%29	2016-02-19 19:10:41 +01:00
Jehan	1694999bce	README: update with VISCII support.	2016-02-13 03:52:06 +01:00
Jehan	98b5e52252	LangModels: add VISCII encoding support and retrain Vietnamese model.	2016-02-13 03:51:18 +01:00
Jehan	600cf76a76	BuildLangModel: try using iconv for conversion when support missing... ... in python. For instance I had the case where the VISCII encoding is supported by iconv but not by encode/decode() function in core python.	2016-02-13 03:47:41 +01:00
Jehan	178c6119b8	LangModels: add Windows-1258 support for Vietnamese. I was planning on adding VISCII support as well, but Python encode() method does not have any support for it apparently, so I cannot generate the proper statistics data with the current version of the string.	2016-02-13 02:32:57 +01:00
Jehan	27135a8880	BuildLangModel: printing a message when discarding a page.	2016-02-13 02:27:15 +01:00
Jehan	0446e24c8d	README: uchardet now available on Fedora. Already in Fedora devel and soon to be added as update on Fedora 23, if I get it correctly. See: https://bugzilla.redhat.com/show_bug.cgi?id=1264713 https://admin.fedoraproject.org/pkgdb/package/rpms/uchardet/	2016-02-12 17:53:22 +01:00
Jehan	248d6dbd35	tools: exit with non-zero value on uchardet error.	2016-01-21 18:16:42 +01:00
Jehan	b6d872bbec	app: package name wrong in CMakeLists.txt. Probably coming from a copy-paste error when the build system was originally created.	2015-12-15 21:40:16 +01:00
Jehan	706023139c	tests: add test files for Arabic. Text taken from: https://ar.wikipedia.org/wiki/%D9%88%D9%8A%D9%86%D8%AF%D9%88%D8%B2-1256	2015-12-13 18:42:59 +01:00
Jehan	9c3c37517c	LangModels: add Arabic support. Models constructed for ISO-8859-6 and Windows-1256.	2015-12-13 18:42:16 +01:00
Jehan	ad2f7212e2	LangModels: retraining Greek models with my training script. This fixes our Greek/Windows-1253 test.	2015-12-13 18:02:11 +01:00
Jehan	1b4c62ac21	tests: test files for Spanish. I disable only ISO-8859-15 which is similar to ISO-8859-1 for all Spanish letters. Unfortunately illegal codepoints are similar too. Difference should likely be done on symbols (like the euro symbol) but our current algorithm does nothing about this for charset comparison. Text from https://es.wikipedia.org/wiki/España	2015-12-12 18:55:43 +01:00
Jehan	ffabb65712	LangModels: adding Spanish support. With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.	2015-12-12 18:54:35 +01:00
Jehan	055332ac7d	BuildLangModel: allow the alphabet list to be written in string format.	2015-12-12 18:50:29 +01:00
Jehan	6b2722885a	BuildLangModel: forgot to add charset/language files.	2015-12-12 18:18:08 +01:00
Jehan	2bade77bf9	tests: update Window-1250 test file for Hungarian. ISO-8859-2 and Windows-1250 are absolutely similar for all letters in the Hungarian alphabet. So for most texts, it is not an error to return one charset or the other. What could make the difference is for instance that Windows-1250 has some symbols where ISO-8859-2 has control characters, like quotes, dashes, the euro symbol… Since control characters have a negative impact on confidence now, texts with such symbols would tend towards Windows-1250 decision. The new test file has such quote symbols.	2015-12-12 18:12:08 +01:00
Jehan	a251753db8	LangModels: updating Hungarian language models.	2015-12-12 18:06:17 +01:00
Jehan	7b4eb9827e	BuildLangModel: add an exception handler on charset spec errors.	2015-12-12 18:00:30 +01:00
Jehan	4c8316f9cf	Nearly-ASCII text with NBSP is still not ASCII. There is no "exception" in encoding. The non-breaking space 0xA0 is not ASCII, and therefore returning "ASCII" will later create issues (for instance trying to re-encode with iconv produces an error). This was obviously an explicit decision in original code (according to code comments), probably tied to specifity of the original program from Mozilla. Now we want strict detection. I will return "ISO-8859-1" for "nearly-ASCII texts with NBSP as only exception" (note that I could have returned any ISO-8859 charsets since they all have this character in common).	2015-12-05 21:11:29 +01:00
Jehan	886e03a523	Release: version 0.0.5. v0.0.5	2015-12-04 22:45:26 +01:00

1 2 3 4

176 Commits