This doesn't work for all probers yet, in particular not for the most
generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL.
It's still a good first step.
Right now, it returns the 2-character language code from ISO 639-1. A
using project could easily get the English language name from the
XML/json files provided by the iso-codes project. This project will also
allow to easily localize the language name in other languages through
gettext (this is what we do in GIMP for instance). I don't add any
dependency though and leave it to downstream projects to implement this.
I was also wondering if we want to support region information for cases
when it would make sense. I especially wondered about it for Chinese
encodings as some of them seem quite specific to a region (according to
Wikipedia at least). For the time being though, these just return "zh".
We'll see later if it makes sense to be more accurate (maybe depending
on reports?).
Adding:
- uchardet_get_candidates()
- uchardet_get_encoding()
- uchardet_get_confidence()
Also deprecating uchardet_get_charset() to have developers look at the
new API instead. I was unsure if this should really get deprecated as it
makes the basic case simple, but the new API is just as easy anyway. You
can also directly call uchardet_get_encoding() with candidate 0 (same as
uchardet_get_charset(), it would then return "" when no candidate was
found).
Preparing for an updated API which will also allow to loop at the
confidence value, as well as get the list of possible candidate (i.e.
all detected encoding which had a confidence value high enough so that
we would even consider them).
It is still only internal logics though.
Since it's a new feature, we may as well write about it, even though I
would personally not recommend this in favor of more standard and
generic pkg-config (which is not dependent on which build system we are
using ourselves).
Newly added IBM865 charset (for Norwegian) can also be used for Danish
By the way, I fixed `script/charsets/ibm865.py` as Danish uses the 'da'
ISO 639-1 code by the way, not 'dk' (which is sometimes used for other
codes for Denmark, such as ISO 3166 country code and internet TLD) but
not for the language itself.
For the test, adding some text from the top article of the day on the
Danish Wikipedia, which was about Jimi Hendrix. And that's cool! 🎸 ;-)
Taken from the page: https://mt.wikipedia.org/wiki/Lingwa_Maltija
The old test was fine but had some French words in it, which lowered the
confidence for Maltese.
Technically it should not be a huge issue in the end, i.e. that if there
are enough actual Maltese words, the stats should still weigh in favor
of Maltese likeness (which they mostly did anyway), but since I am
making some other changes, this was just not enough. In particular I was
changing some of the UTF-8 confidence logics and the file ended up
detected as UTF-8 (even though it has illegal sequence and cannot be!
Cf. #9).
So the real long-term solution is to actually fix our UTF-8 detector,
which I'll do at some point, but for the time being, let's have definite
non-questionable Maltese in there to simplify testing at this early
stage of uchardet rewriting.
Though it is highly encouraged to do out-of-source builds, it is not
strictly forbidden to do in-source builds. So we should ignore the files
generated by CMake. Only tested with a Linux build, with both make and
ninja backends.
I added .dll and .dylib versions (for Windows and macOS respectively),
guessing these will be the file names on these platforms, unless
mistaken (since untested).
As discussed in !10, let's add with this commit files generated by the
build system, but not any personal environment files (specific to
contributors' environment).
If I missed any file name which can be generated by the build system in
some platforms, configuration, or condition, let's add them as we
discover them.
The minimum required cmake version is raised to 3.1,
because the exported targets started at that version.
The build system creates the exported targets:
- The executable uchardet::uchardet
- The library uchardet::libuchardet
- The static library uchardet::libuchardet_static
A downstream project using CMake can find and link the library target
directly with cmake (without needing pkg-config) this way:
~~~
project(sample LANGUAGES C)
find_package ( uchardet )
if (uchardet_FOUND)
add_executable( sample sample.c )
target_link_libraries ( sample PRIVATE uchardet::libuchardet )
endif ()
~~~
After installing uchardet in a prefix like "$HOME/uchardet/":
cmake -DCMAKE_PREFIX_PATH="$HOME/uchardet/;..."
Instead installing, the build directory can be used directly, for
instance:
cmake -Duchardet_DIR="$HOME/uchardet-0.1.0/build/" ...
Replace the old link to the science paper by one on archive-mozilla
website. Remove the original source link as I can't find any archived
version of it (even on archive.org, only the folder structure is saved,
not actual files themselves, so it's useless).
Also add some history, which is probably a nice touch.
Add a link to crossroad to help people who'd want to cross-compile
uchardet.
Finally add the R binding by Artem Klevtsov and QtAV as reported.
- Fix string no output variables on UWP
On UWP, CMAKE_SYSTEM_PROCESSOR may be empty. As a result:
string(TOLOWER ${CMAKE_SYSTEM_PROCESSOR} TARGET_ARCHITECTURE)
will be treated as:
string(TOLOWER TARGET_ARCHITECTURE)
which, as a result, will cause a CMake error:
CMake Error at CMakeLists.txt:42 (string):
string no output variable specified
- Remove unnecessary header inclusions in uchardet.cpp
These extra inclusions cause build errors on Windows.
Not sure if it is in the C++ standard, or was, but apparently some
compilers may complain when files don't end with a newline (though
neither GCC nor Clang as our CI and my local builds are fine).
So here are all our generated source which didn't have such ending
newline (hopefully I forgot none). I just loaded them in my vim editor,
and resaved them. This was enough to add an ending newline.
uchardet_handle_data() should not try to process data of nul length.
Still this is not technically an error to feed empty data to the engine,
and I could imagine it could happen especially when done in some
automatic process with random input files (which looks like what was
happening in the reporter case). So feeding empty data just returns a
success without actually doing any processing, allowing to continue the
data feed.
My previous commit was good except for the very special case of wanting
to analyze a file named "--". This file would be ignored.
With this change, only the first "--" option will be ignored as meaning
"end of option arguments", but any remaining value (another "--"
included) will be considered as a file path.
It says that's for Win32 platform and uses the install prefix as library
prefix. But that's not at all the same kind of prefixes!
CMAKE_INSTALL_PREFIX expected value is the path to install the lib (what
is called the "installation prefix"), whereas CMAKE_*_LIBRARY_PREFIX are
the prefix on the file name (usually "lib" on UNIX-like systems).
Anyway I don't see a need to change this value. It will be called
"libuchardet.dll" on Win32. I don't see the problem.
Also this code was already commented out, and compilation and usage for
Win32 works just fine without it. :-)