50 Commits

Author SHA1 Message Date
Jehan
a251753db8 LangModels: updating Hungarian language models. 2015-12-12 18:06:17 +01:00
Jehan
4c8316f9cf Nearly-ASCII text with NBSP is still not ASCII.
There is no "exception" in encoding. The non-breaking space 0xA0 is not
ASCII, and therefore returning "ASCII" will later create issues (for
instance trying to re-encode with iconv produces an error).
This was obviously an explicit decision in original code (according to
code comments), probably tied to specifity of the original program from
Mozilla. Now we want strict detection.
I will return "ISO-8859-1" for "nearly-ASCII texts with NBSP as only
exception" (note that I could have returned any ISO-8859 charsets since
they all have this character in common).
2015-12-05 21:11:29 +01:00
Jehan
e5234d6b61 Stating endianness of UTF-16 and UTF-32 was an error when BOM present.
According to RFC 2781, section 3.3: "Systems labelling UTF-16BE/LE text
MUST NOT prepend a BOM to the text."
Since uchardet cannot (and should not, obviously, it's not its role)
modify input text, when a BOM is present, we should always label the
encoding as "UTF-16" only.
Also it broke unit tests in using programs since a conversion from UTF-8
to UTF-16LE/BE would create a text without BOM, and a conversion from
UTF-16LE/BE to UTF-8 creates a UTF-8 text with a BOM, which changed
existing behaviours.
Same goes for UTF-32.
See also Unicode 5.0.0 standard, section 3.10 (tables 3.8 and 3.9 in
particular).
2015-12-04 19:19:39 +01:00
Jehan
5691dc59a1 LangModels: rename Cyrillic models to Russian models.
Our language models are per-lang, not per script.
2015-12-04 03:27:29 +01:00
Jehan
fb3c47a073 LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models.
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
2015-12-04 03:14:52 +01:00
Jehan
5ee1c3ee39 LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9. 2015-12-04 02:35:09 +01:00
Jehan
f0e122b506 LangModels: add Esperanto ISO-8859-3 language model. 2015-12-04 01:35:56 +01:00
Jehan
55b4f23971 Single Byte charsets: high ctrl character ratio lowers confidence.
Control characters are not an error per-se. Nevertheless they are clearly not
frequent in single-byte charset texts. It is only normal for them to lower
confidence in a charset. In particular a higher ctrl-per-letter ratio means
a lower confidence.
This fixes for instance our Windows-1252 German test (otherwise detected as
ISO-8859-1).
2015-12-04 00:04:43 +01:00
Jehan
aa587a64bd LangModels: adding German models for ISO-8859-1 and Windows-1252. 2015-12-03 23:58:41 +01:00
Jehan
0270b1e856 Adding French Windows-1252 support. 2015-12-03 21:22:30 +01:00
Jehan
ea34e8b1bd Update doc comment.
We do not return empty string on ASCII anymore. It means only detection
failure, now. ASCII will get a proper "ASCII" return.
2015-12-03 20:36:09 +01:00
Jehan
ba56d91808 Update uchardet URL in various places. 2015-12-03 19:48:29 +01:00
Jehan
d1bc09e4d7 Update authors.
I think I deserved being listed in the authors by now. ;-)
2015-12-03 19:44:13 +01:00
Jehan
c4fa728e7a Merge branch 'master' of https://github.com/lovasoa/uchardet into lovasoa-master
Let's shortcut Single Byte charset detection on invalid codepoints.
Merging and fixing the contributor's commit conflicts after code
redesign: in particular we added an illegal character concept (they were
mixed with control characters in current charmaps. Yet ctrl characters
are NOT to be considered invalid) and constants instead of hardcoded
numbers ('ILL' rather than 255).
2015-12-03 19:26:19 +01:00
Jehan
d686fcc1cd LangModels: add illegal codepoints information on single byte charmaps. 2015-12-03 19:04:07 +01:00
Jehan
683255278d Re-enable Hungarian language models.
Now that we have at least one model for ISO-8859-1, the risk of
detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.
2015-12-02 22:24:36 +01:00
Jehan
4f1c3ff85e nsSBCharSetProber: multiply confidence by ratio of positive seqs per chars.
If all sequences in a text are positive sequences, the ratio of positive
sequences cannot make the difference between 2 very close charsets.
A ratio of positive sequences per letters on the other hand will
change a tie between 2 encoding. If while adding a letter, the number
of positive sequences does not increase, the confidence will decrease
(corresponding to the fact it was likely not a letter).
On the other hand, if the number of positive sequences increase, so
will the confidence.
For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15.
When letters only available in ISO-8859-15 appear in a text, we expect
confidence to tilt towards the close yet slightly different ISO-8859-15.
2015-11-30 19:52:07 +01:00
Jehan
9cb5764b73 LangModels: update the French language models.
Fully built with the script.
2015-11-30 19:20:55 +01:00
Jehan
dbb4c1d2ff nsSBCharSetProber: replace the fixed 64 SAMPLE_SIZE...
... with per-language model "frequent character" count.
2015-11-29 23:51:55 +01:00
Jehan
0289c2a232 Differentiate ASCII and detection failure.
The lib used to return "" for both properly detected ASCII and
detection failure. And the tool would return "ascii/unknown".
Make a proper distinction between the 2 cases.
2015-11-28 17:04:52 +01:00
Jehan
005fd98086 Add initial support for French with ISO-8859-1 and ISO-8859-15.
Mostly generated with a script from Wikipedia data (only the typical
positive ratio is slightly modified).
This is a first test before adding my generating script to the main tree.
2015-11-28 02:14:39 +01:00
Jehan
2106173546 Move all Single-Byte language models to a subdirectory. 2015-11-27 23:11:23 +01:00
Jehan
984d8f7b09 Add language information in model names when they were missing.
Models are language specific (there could be several models for the same
charset but different languages). Let's have a clear naming scheme.
2015-11-27 18:21:13 +01:00
Jehan
42b91898da Create 3-letter constants for special charmap characters.
Control characters, carriage, symbols and numbers.
Also add a constant for illegal characters (not used for now).
This will allow easier processing and charmap reading.
2015-11-27 17:41:54 +01:00
Ophir LOJKINE
5ef60164fc Stop detection early on control characters 2015-11-24 22:07:41 +03:00
Jehan
e8dd55995a Add "LE/BE" suffix to "UTF-16" result for Little/Big Endian info...
... and add UTF-32 BOM detection.
2015-11-24 18:50:23 +01:00
Jehan
9a74d08b3c Fix minor space issues. 2015-11-24 00:15:44 +01:00
Jehan
35153b1e50 Fixes boolean operation precedence warnings...
... and some minor space issues.
Some explicit parentheses were needed to make precedence obvious.
Warning was:
"warning: suggest parentheses around ‘&&’ within ‘||’ [-Wparentheses]"
2015-11-18 19:38:12 +01:00
Jehan
9d9257072a s/windows-1255/WINDOWS-1255/ to follow iconv uppercase naming. 2015-11-18 03:21:34 +01:00
Jehan
41f3b757f1 Some more encoding names changed to be iconv-compatible.
I forgot to fix some names.
In particular "x-mac-cyrillic" is not valid in iconv, and has been
changed to "MAC-CYRILLIC".
2015-11-17 18:51:45 +01:00
Jehan
ad4dfc4be4 Add a BUILD_STATIC CMake option to optionally build a static library.
It is still ON by default, which means both shared and static libs will
be built and installed (current behavior), but it makes it possible to
disable the build of a static lib.
Closes https://github.com/BYVoid/uchardet/issues/1.
2015-11-17 18:14:51 +01:00
Jehan
dc371f3ba9 uchardet_get_charset() must return iconv-compatible names.
It was not clear if our naming followed any kind of rules. In particular,
iconv is a widely used encoding conversion API. We will follow its
naming.
At least 1 returned name was found invalid: x-euc-tw instead of EUC-TW.
Other names have been uppercased to follow naming from `iconv --list`
though iconv is mostly case-insensitive so it should not have been a
problem. "Just in case".
Prober names can still have free naming (only used for output display
apparently).
Finally HZ-GB-2312 is absent from my iconv list, but I can still see
this encoding in libiconv master code with this name. So I will
consider it valid.
2015-11-17 16:15:21 +01:00
Jehan
256d1957b2 uchardet_get_charset() should never return NULL...
... to stay backward compatible with previous behavior.
About detection failure, our in-code documentation says:
"@return name of charset on success and "" on failure or pure ascii."
This behavior had been broken by commit 3a518c0, which returned NULL
instead. Our command-line tool was the first victim, segfaulting on
ASCII files.
2015-11-16 17:33:16 +01:00
Carbo Kuo
016eb18437 Merge pull request #15 from wang-bin/c++abi
do not use std::string which breaks c++ abi
2015-11-09 20:04:21 +01:00
wang-bin
3a518c0536 do not use std::string which breaks c++ abi
Some stl types can break abi. If the program is built with g++ 5, and libstdc++ on the target platform is g++ 4.x, then it can not run
2015-11-04 18:16:24 +08:00
Jehan
ba97505efc (void) and () empty arguments are different in C.
This fixes the following warning when including uchardet.h in C source,
built with -Wstrict-prototypes:
`uchardet.h:52:1: warning: function declaration isn't a prototype`
2015-09-05 15:58:56 +02:00
wm4
d59294a00e Header conformance fixes
Identifiers starting with __ are reserved for the system - user code
(including non-system libraries) must not define them.

A function which takes no parameters is declared with "(void)". In C, an
empty parameter list means that any number of parameters with
unspecified types is allowed, which is not what we want in this case.
Another reason to fix this is that compilers often warn if this legacy
feature is used, which is bothersome for API users.

Additionally, use an opaque struct as underlying type for uchardet_t.
This facilitates type-checking, as it's harder to confuse with other
types, especially in C. This is not strictly a conformance issue, but
still a nice change. Note that this is neither an API or an ABI change.
2015-08-05 22:24:49 +02:00
Loic Le Loarer
07af96b3a7 Use perror for error report 2015-07-16 01:20:03 +02:00
Loic Le Loarer
1c89a2f8ff Use stdin by default as before 2015-07-16 01:15:08 +02:00
Loic Le Loarer
972d061e90 Allow multiple filename in the command line 2015-07-16 00:59:58 +02:00
nu774
f5637b23b8 fix for MinGW build 2015-06-20 12:28:01 +09:00
nu774
ba6679f2b3 fix: export symbols were not passed to the linker as intended 2015-06-20 12:28:01 +09:00
BYVoid
06e65096f1 Add comments on uchardet.h 2011-07-11 15:25:31 +08:00
BYVoid
84284eccf4 Update code from upstream. 2011-07-11 14:42:50 +08:00
BYVoid
331af64156 Add command line interface. 2011-07-10 16:42:38 +08:00
BYVoid
1b05009d4d Update contributors information. 2011-07-10 15:43:28 +08:00
BYVoid
e948063c0e Refine ucharder.h 2011-07-10 15:41:24 +08:00
BYVoid
1094508286 Dos2unix. 2011-07-10 15:20:41 +08:00
BYVoid
9be8afdfb9 Compelete comments on intercaface. 2011-07-10 15:20:05 +08:00
BYVoid
3601900164 Initial release. 2011-07-10 15:04:42 +08:00