10 Commits

Author SHA1 Message Date
Jehan
7f99b91388 src: new weight concept in the C API.
Pretty basic, you can weight prefered language and this will impact the
result. Say the algorithm "hesitates" between encoding E1 in language L1
and encoding E2 in language L2. By setting L2 with a 1.1 weight, for
instance because this is the OS language, or usual prefered language,
you may help the algorithm to overcome very tight cases.

It can also be helpful when you already know for sure the language of a
document, you just don't know its encoding. Then you may set a very high
value for this language, or simply set a default value of 0, and set 1
for this language. Only relevant encoding will be taken into account.

This is still limited though as generic encoding are still implemented
language-agnostic. UTF-8 for instance would be disadvantaged by this
weight system until we make it language-aware.
2021-03-14 00:12:30 +01:00
Jehan
911695f682 src: new API to get the detected language.
This doesn't work for all probers yet, in particular not for the most
generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL.
It's still a good first step.

Right now, it returns the 2-character language code from ISO 639-1. A
using project could easily get the English language name from the
XML/json files provided by the iso-codes project. This project will also
allow to easily localize the language name in other languages through
gettext (this is what we do in GIMP for instance). I don't add any
dependency though and leave it to downstream projects to implement this.

I was also wondering if we want to support region information for cases
when it would make sense. I especially wondered about it for Chinese
encodings as some of them seem quite specific to a region (according to
Wikipedia at least). For the time being though, these just return "zh".
We'll see later if it makes sense to be more accurate (maybe depending
on reports?).
2021-03-14 00:12:30 +01:00
Jehan
4da22cca97 src: new API to get all candidates and their confidence.
Adding:
- uchardet_get_candidates()
- uchardet_get_encoding()
- uchardet_get_confidence()

Also deprecating uchardet_get_charset() to have developers look at the
new API instead. I was unsure if this should really get deprecated as it
makes the basic case simple, but the new API is just as easy anyway. You
can also directly call uchardet_get_encoding() with candidate 0 (same as
uchardet_get_charset(), it would then return "" when no candidate was
found).
2021-03-14 00:12:30 +01:00
Ilya Tumaykin
6db8b6f8fe
cmake: minor comment cleanups 2016-03-22 01:23:06 +03:00
Ilya Tumaykin
1a1f4bfbd8
cmake: rename UCHARDET_{TARGET -> LIBRARY} for clarity 2016-03-22 01:23:03 +03:00
Ilya Tumaykin
b44be77be6
cmake: uniform indent everywhere
Indent with tabs, remove leading/trailing blank lines and spaces.
2016-03-21 01:07:41 +03:00
Ricardo Constantino (:RiCON)
78b55ec9fe CMake: Fix regression in f53cb8c building in paths with spaces
Tested with Ninja and Make in Windows and Archlinux with paths
with and without spaces.
2016-03-18 03:37:12 +00:00
Ricardo Constantino (:RiCON)
f53cb8cddd CMake: fix linking with Ninja 2016-03-16 14:17:47 +00:00
nu774
ba6679f2b3 fix: export symbols were not passed to the linker as intended 2015-06-20 12:28:01 +09:00
BYVoid
3601900164 Initial release. 2011-07-10 15:04:42 +08:00