Pretty basic, you can weight prefered language and this will impact the
result. Say the algorithm "hesitates" between encoding E1 in language L1
and encoding E2 in language L2. By setting L2 with a 1.1 weight, for
instance because this is the OS language, or usual prefered language,
you may help the algorithm to overcome very tight cases.
It can also be helpful when you already know for sure the language of a
document, you just don't know its encoding. Then you may set a very high
value for this language, or simply set a default value of 0, and set 1
for this language. Only relevant encoding will be taken into account.
This is still limited though as generic encoding are still implemented
language-agnostic. UTF-8 for instance would be disadvantaged by this
weight system until we make it language-aware.
This doesn't work for all probers yet, in particular not for the most
generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL.
It's still a good first step.
Right now, it returns the 2-character language code from ISO 639-1. A
using project could easily get the English language name from the
XML/json files provided by the iso-codes project. This project will also
allow to easily localize the language name in other languages through
gettext (this is what we do in GIMP for instance). I don't add any
dependency though and leave it to downstream projects to implement this.
I was also wondering if we want to support region information for cases
when it would make sense. I especially wondered about it for Chinese
encodings as some of them seem quite specific to a region (according to
Wikipedia at least). For the time being though, these just return "zh".
We'll see later if it makes sense to be more accurate (maybe depending
on reports?).
Adding:
- uchardet_get_candidates()
- uchardet_get_encoding()
- uchardet_get_confidence()
Also deprecating uchardet_get_charset() to have developers look at the
new API instead. I was unsure if this should really get deprecated as it
makes the basic case simple, but the new API is just as easy anyway. You
can also directly call uchardet_get_encoding() with candidate 0 (same as
uchardet_get_charset(), it would then return "" when no candidate was
found).
It was not clear if our naming followed any kind of rules. In particular,
iconv is a widely used encoding conversion API. We will follow its
naming.
At least 1 returned name was found invalid: x-euc-tw instead of EUC-TW.
Other names have been uppercased to follow naming from `iconv --list`
though iconv is mostly case-insensitive so it should not have been a
problem. "Just in case".
Prober names can still have free naming (only used for output display
apparently).
Finally HZ-GB-2312 is absent from my iconv list, but I can still see
this encoding in libiconv master code with this name. So I will
consider it valid.
Identifiers starting with __ are reserved for the system - user code
(including non-system libraries) must not define them.
A function which takes no parameters is declared with "(void)". In C, an
empty parameter list means that any number of parameters with
unspecified types is allowed, which is not what we want in this case.
Another reason to fix this is that compilers often warn if this legacy
feature is used, which is bothersome for API users.
Additionally, use an opaque struct as underlying type for uchardet_t.
This facilitates type-checking, as it's harder to confuse with other
types, especially in C. This is not strictly a conformance issue, but
still a nice change. Note that this is neither an API or an ABI change.