16 Commits

Author SHA1 Message Date
Jehan
7f99b91388 src: new weight concept in the C API.
Pretty basic, you can weight prefered language and this will impact the
result. Say the algorithm "hesitates" between encoding E1 in language L1
and encoding E2 in language L2. By setting L2 with a 1.1 weight, for
instance because this is the OS language, or usual prefered language,
you may help the algorithm to overcome very tight cases.

It can also be helpful when you already know for sure the language of a
document, you just don't know its encoding. Then you may set a very high
value for this language, or simply set a default value of 0, and set 1
for this language. Only relevant encoding will be taken into account.

This is still limited though as generic encoding are still implemented
language-agnostic. UTF-8 for instance would be disadvantaged by this
weight system until we make it language-aware.
2021-03-14 00:12:30 +01:00
Jehan
911695f682 src: new API to get the detected language.
This doesn't work for all probers yet, in particular not for the most
generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL.
It's still a good first step.

Right now, it returns the 2-character language code from ISO 639-1. A
using project could easily get the English language name from the
XML/json files provided by the iso-codes project. This project will also
allow to easily localize the language name in other languages through
gettext (this is what we do in GIMP for instance). I don't add any
dependency though and leave it to downstream projects to implement this.

I was also wondering if we want to support region information for cases
when it would make sense. I especially wondered about it for Chinese
encodings as some of them seem quite specific to a region (according to
Wikipedia at least). For the time being though, these just return "zh".
We'll see later if it makes sense to be more accurate (maybe depending
on reports?).
2021-03-14 00:12:30 +01:00
Jehan
4da22cca97 src: new API to get all candidates and their confidence.
Adding:
- uchardet_get_candidates()
- uchardet_get_encoding()
- uchardet_get_confidence()

Also deprecating uchardet_get_charset() to have developers look at the
new API instead. I was unsure if this should really get deprecated as it
makes the basic case simple, but the new API is just as easy anyway. You
can also directly call uchardet_get_encoding() with candidate 0 (same as
uchardet_get_charset(), it would then return "" when no candidate was
found).
2021-03-14 00:12:30 +01:00
wangqr
ae7acbd0f2 Add dllexport to interface functions
This allows building the DLL on Windows with other compilers than GNU ones.
See MR !4.
2020-04-22 18:54:07 +00:00
Jehan
170ef349cf src: fix some doc comments. s/a instance/an instance/.
Unless mistaken, we should use "an" with next word starting with
vowel.
2017-08-19 10:46:25 +02:00
Jehan
c049332c41 src: s/detctor/detector/. 2017-08-18 12:03:54 +02:00
Jehan
a59b1c9571 src: update documentation comments on the public API. 2016-09-21 17:36:17 +02:00
Jehan
ea34e8b1bd Update doc comment.
We do not return empty string on ASCII anymore. It means only detection
failure, now. ASCII will get a proper "ASCII" return.
2015-12-03 20:36:09 +01:00
Jehan
dc371f3ba9 uchardet_get_charset() must return iconv-compatible names.
It was not clear if our naming followed any kind of rules. In particular,
iconv is a widely used encoding conversion API. We will follow its
naming.
At least 1 returned name was found invalid: x-euc-tw instead of EUC-TW.
Other names have been uppercased to follow naming from `iconv --list`
though iconv is mostly case-insensitive so it should not have been a
problem. "Just in case".
Prober names can still have free naming (only used for output display
apparently).
Finally HZ-GB-2312 is absent from my iconv list, but I can still see
this encoding in libiconv master code with this name. So I will
consider it valid.
2015-11-17 16:15:21 +01:00
wm4
d59294a00e Header conformance fixes
Identifiers starting with __ are reserved for the system - user code
(including non-system libraries) must not define them.

A function which takes no parameters is declared with "(void)". In C, an
empty parameter list means that any number of parameters with
unspecified types is allowed, which is not what we want in this case.
Another reason to fix this is that compilers often warn if this legacy
feature is used, which is bothersome for API users.

Additionally, use an opaque struct as underlying type for uchardet_t.
This facilitates type-checking, as it's harder to confuse with other
types, especially in C. This is not strictly a conformance issue, but
still a nice change. Note that this is neither an API or an ABI change.
2015-08-05 22:24:49 +02:00
BYVoid
06e65096f1 Add comments on uchardet.h 2011-07-11 15:25:31 +08:00
BYVoid
1b05009d4d Update contributors information. 2011-07-10 15:43:28 +08:00
BYVoid
e948063c0e Refine ucharder.h 2011-07-10 15:41:24 +08:00
BYVoid
1094508286 Dos2unix. 2011-07-10 15:20:41 +08:00
BYVoid
9be8afdfb9 Compelete comments on intercaface. 2011-07-10 15:20:05 +08:00
BYVoid
3601900164 Initial release. 2011-07-10 15:04:42 +08:00