384 Commits

Author SHA1 Message Date
Jehan
c218a3ccd6 README: add a section about CMake exported targets.
Since it's a new feature, we may as well write about it, even though I
would personally not recommend this in favor of more standard and
generic pkg-config (which is not dependent on which build system we are
using ourselves).
2022-11-30 23:48:16 +01:00
Jehan
6196f86c46 README: update with newly added (lang, charset) couples. 2022-11-30 20:06:52 +01:00
Jehan
388777be51 script, src, test: add IBM865 support for Danish.
Newly added IBM865 charset (for Norwegian) can also be used for Danish

By the way, I fixed `script/charsets/ibm865.py` as Danish uses the 'da'
ISO 639-1 code by the way, not 'dk' (which is sometimes used for other
codes for Denmark, such as ISO 3166 country code and internet TLD) but
not for the language itself.

For the test, adding some text from the top article of the day on the
Danish Wikipedia, which was about Jimi Hendrix. And that's cool! 🎸 ;-)
2022-11-30 19:57:52 +01:00
Jehan
5aa628272b script: fix small issues with commits e41e8a4 and 8d15d6b. 2022-11-30 19:24:28 +01:00
Martin T. H. Sandsmark
c11c362b89 Add tests for norwegian 2022-11-30 19:09:21 +01:00
Martin T. H. Sandsmark
099a9a4fd6 Add norwegian support 2022-11-30 19:09:09 +01:00
Martin T. H. Sandsmark
e41e8a47e4 improve model building script a bit 2022-11-30 19:09:09 +01:00
Martin T. H. Sandsmark
8d15d6b557 make the logfile usable 2022-11-30 19:09:09 +01:00
Jehan
2a04e57c8f test: update the Maltese / ISO-8859-3 test file.
Taken from the page: https://mt.wikipedia.org/wiki/Lingwa_Maltija
The old test was fine but had some French words in it, which lowered the
confidence for Maltese.
Technically it should not be a huge issue in the end, i.e. that if there
are enough actual Maltese words, the stats should still weigh in favor
of Maltese likeness (which they mostly did anyway), but since I am
making some other changes, this was just not enough. In particular I was
changing some of the UTF-8 confidence logics and the file ended up
detected as UTF-8 (even though it has illegal sequence and cannot be!
Cf. #9).

So the real long-term solution is to actually fix our UTF-8 detector,
which I'll do at some point, but for the time being, let's have definite
non-questionable Maltese in there to simplify testing at this early
stage of uchardet rewriting.
2022-11-29 14:59:17 +01:00
Lucinda May Phipps
45bd32d102 src/tools/uchardet.cpp: make stuff static 2022-11-29 13:57:31 +00:00
Lucinda May Phipps
ef19faa8c5 Update uchardet-tests.c 2022-11-29 13:57:31 +00:00
Lucinda May Phipps
383bf118c9 don't use feof 2022-11-29 13:57:31 +00:00
myd7349
143b3fe513 README: update libchardet repository link 2022-08-01 19:38:19 +08:00
andiwand
23a664560b Issue #27: fix cmake 2021-12-01 13:49:37 +01:00
Jehan
b3b2bd2721 gitignore: I forgot the 2 executables (CLI tool and test binary). 2021-11-09 14:26:21 +01:00
Jehan
48db2b0800 gitignore: add files generated by the build system.
Though it is highly encouraged to do out-of-source builds, it is not
strictly forbidden to do in-source builds. So we should ignore the files
generated by CMake. Only tested with a Linux build, with both make and
ninja backends.

I added .dll and .dylib versions (for Windows and macOS respectively),
guessing these will be the file names on these platforms, unless
mistaken (since untested).

As discussed in !10, let's add with this commit files generated by the
build system, but not any personal environment files (specific to
contributors' environment).
If I missed any file name which can be generated by the build system in
some platforms, configuration, or condition, let's add them as we
discover them.
2021-11-09 14:05:31 +01:00
Pedro López-Cabanillas
d7dad549bd cmake exported targets
The minimum required cmake version is raised to 3.1,
because the exported targets started at that version.

The build system creates the exported targets:
- The executable uchardet::uchardet
- The library uchardet::libuchardet
- The static library uchardet::libuchardet_static

A downstream project using CMake can find and link the library target
directly with cmake (without needing pkg-config) this way:

~~~
project(sample LANGUAGES C)
find_package ( uchardet )
if (uchardet_FOUND)
  add_executable( sample sample.c )
  target_link_libraries ( sample PRIVATE uchardet::libuchardet )
endif ()
~~~

After installing uchardet in a prefix like "$HOME/uchardet/":
cmake -DCMAKE_PREFIX_PATH="$HOME/uchardet/;..."

Instead installing, the build directory can be used directly, for
instance:

cmake -Duchardet_DIR="$HOME/uchardet-0.1.0/build/" ...
2021-11-09 09:52:15 +00:00
Aaron Madlon-Kay
6f38ab95f5 Mention MacPorts in readme 2021-01-27 06:57:58 +00:00
Jehan
c8a3572cca Issue #17: update README.
Replace the old link to the science paper by one on archive-mozilla
website. Remove the original source link as I can't find any archived
version of it (even on archive.org, only the folder structure is saved,
not actual files themselves, so it's useless).

Also add some history, which is probably a nice touch.

Add a link to crossroad to help people who'd want to cross-compile
uchardet.

Finally add the R binding by Artem Klevtsov and QtAV as reported.
2020-04-29 16:20:00 +02:00
Jehan
472a906844 Issue #16: "i686" uname not properly detected as x86.
This is basically a continuation of an older bug from Bugzilla:
https://bugs.freedesktop.org/show_bug.cgi?id=101033
2020-04-28 20:43:12 +02:00
myd7349
8681fc060e build: Add uchardet CLI tool building support for MSVC 2020-04-26 08:16:14 +00:00
myd7349
5bcbd23acf build: Fix build errors on Windows
- Fix string no output variables on UWP

  On UWP, CMAKE_SYSTEM_PROCESSOR may be empty. As a result:
  string(TOLOWER ${CMAKE_SYSTEM_PROCESSOR} TARGET_ARCHITECTURE)
  will be treated as:
  string(TOLOWER TARGET_ARCHITECTURE)
  which, as a result, will cause a CMake error:

  CMake Error at CMakeLists.txt:42 (string):
    string no output variable specified

- Remove unnecessary header inclusions in uchardet.cpp

  These extra inclusions cause build errors on Windows.
2020-04-26 10:08:45 +08:00
Jehan
a49f8ef6ea doc: update README.maintainer.
There is one more step to transform a git tag into a proper "Gitlab
release" with the new platform.
2020-04-23 12:32:49 +02:00
Jehan
59f68dbe57 Release: version 0.0.7 v0.0.7 2020-04-23 11:48:58 +02:00
Jehan
98bc2f31ef Issue #8: have BuildLangModel.py add ending newline to generated source. 2020-04-22 22:57:25 +02:00
Jehan
44a50c30ee Issue #8: no newline at end of file.
Not sure if it is in the C++ standard, or was, but apparently some
compilers may complain when files don't end with a newline (though
neither GCC nor Clang as our CI and my local builds are fine).

So here are all our generated source which didn't have such ending
newline (hopefully I forgot none). I just loaded them in my vim editor,
and resaved them. This was enough to add an ending newline.
2020-04-22 22:53:25 +02:00
Jehan
6c7f32a751 Issue #10: Crashing sequence with nsSJISProber.
uchardet_handle_data() should not try to process data of nul length.

Still this is not technically an error to feed empty data to the engine,
and I could imagine it could happen especially when done in some
automatic process with random input files (which looks like what was
happening in the reporter case). So feeding empty data just returns a
success without actually doing any processing, allowing to continue the
data feed.
2020-04-22 22:11:51 +02:00
Jehan
ef0313046b Also allow uchardet tool to detect encoding of a file named "--".
My previous commit was good except for the very special case of wanting
to analyze a file named "--". This file would be ignored.

With this change, only the first "--" option will be ignored as meaning
"end of option arguments", but any remaining value (another "--"
included) will be considered as a file path.
2020-04-22 21:11:23 +02:00
Jehan
4a37dfdf1c Issue #15: support "--" end-of-option. 2020-04-22 21:05:44 +02:00
wangqr
ae7acbd0f2 Add dllexport to interface functions
This allows building the DLL on Windows with other compilers than GNU ones.
See MR !4.
2020-04-22 18:54:07 +00:00
Artem Klevtsov
2694ba6363 Fix global-buffer-overflow due EUCTW_TABLE_SIZE 2020-04-22 17:06:40 +00:00
Jehan
81ab1d1da1 gitlab-ci: Adding a Clang build. 2020-04-22 18:04:56 +02:00
Jehan
6afec53adc gitlab-ci: Windows 32 and 64-bit builds. 2020-04-22 18:00:36 +02:00
Jehan
b5674dbd50 gitlab-ci: first CI build for uchardet.
Very simple CI since uchardet is an extremely low/no dependency library.
So basically we install CMake in Debian/testing and we are good.
2020-04-22 17:22:23 +02:00
Jehan
e0b9269849 Fix various other occurrences of bug tracker URL in code/build. 2020-04-22 12:29:41 +02:00
Jehan
60bf53c81e README: update to Gitlab links.
Freedesktop moved its infrastructure to Gitlab a while ago.
2020-04-22 00:33:48 +02:00
Jehan
0cfb75724a README: some small updates. 2020-04-22 00:17:23 +02:00
Jehan
bdfd6116a9 Add a mention about fd.o code of conduct. 2018-09-26 15:12:25 +02:00
Ilya Tumaykin
f136d434f0 build: turn TARGET_ARCHITECTURE into option
Default value is autodetected if not specified by user.
2018-01-21 15:58:13 +01:00
Jehan
95872ef41c Adding some information about building for Windows. 2017-12-26 03:37:42 +01:00
Jehan
df67ae4fe0 CMake: get rid of some commented code.
It says that's for Win32 platform and uses the install prefix as library
prefix. But that's not at all the same kind of prefixes!
CMAKE_INSTALL_PREFIX expected value is the path to install the lib (what
is called the "installation prefix"), whereas CMAKE_*_LIBRARY_PREFIX are
the prefix on the file name (usually "lib" on UNIX-like systems).
Anyway I don't see a need to change this value. It will be called
"libuchardet.dll" on Win32. I don't see the problem.
Also this code was already commented out, and compilation and usage for
Win32 works just fine without it. :-)
2017-12-24 19:47:05 +01:00
Jehan
cd617d181d CMake: do not check/set SSE and float-store options on non-x86 targets.
Not sure if that's right. I guess we might also find non-x86 machines
where floating point computation won't follow IEEE standard as well. But
let's do this for now to prevent from useless performance hit.
2017-11-07 00:37:54 +01:00
Jehan
939482ab2b CMake: slightly improve the configuration option messages.
Also add full stops, similarly to CMake defaut options.
2017-11-06 02:11:20 +01:00
Jehan
77bf71ea36 CMake: rename s/ENABLE_SSE2/CHECK_SSE2/.
"ENABLE_SSE2" may be misleading since having it ON does not necessarily
mean that SSE2 flags will be actually set. It only means that the
support will be checked (then set only when supported).
Also adding the warning about possible performance decrease.
2017-11-06 02:07:40 +01:00
Jehan
5996bbd995 Bug 101033 - Testsuite fails on i386.
Floating point accuracy may be different depending on the architecture.
In particular some architectures may store floating values with
different precision, resulting in unreliable results across various
machines. It would seem in particular true on older x86 machines without
SSE support, which were reported cases.
The proposed solution is to test for SSE support and explicitly add the
proper flags (even though they are set by default anyway on modern x86).
When this is not available (on older machines or simply when not on x86
processors), I replace sse2 flags with -ffloat-store, which forces IEEE
floating point definition.
The reason why not to always force -ffloat-store is because it seems to
decrease performance on some machines. SSE is prefered if available.

I also add a ENABLE_SSE2 option on the CMake file to allow builders to
use -ffloat-store even though SSE2 may be available on the build
machine. This would allow to build portable binaries which can also be
installed on older machines.
2017-11-06 01:56:45 +01:00
Jehan
056a5a6e51 README: add some applications having uchardet as dependency.
There are likely more (and I know some are planning support) but these
are the ones I know of and with support already in.
2017-09-21 00:06:03 +02:00
Jehan
1898847eb6 src: cast value to its proper type.
Thanks to Marino Faggiana for reporting it.
See: https://github.com/BYVoid/uchardet/issues/37
2017-08-27 13:01:30 +02:00
Jehan
170ef349cf src: fix some doc comments. s/a instance/an instance/.
Unless mistaken, we should use "an" with next word starting with
vowel.
2017-08-19 10:46:25 +02:00
Jehan
c049332c41 src: s/detctor/detector/. 2017-08-18 12:03:54 +02:00
Jehan
d9d014742a README: Gentoo also has a uchardet package.
And it is up-to-date with upstream URL at Freedesktop! Good!
2017-05-28 21:13:59 +02:00