61 Commits

Author SHA1 Message Date
Jehan
ff5fd5eff9 Release: version 0.0.3. v0.0.3 2015-11-19 15:18:11 +01:00
Jehan
5dcff7b241 Hide away tests known to fail.
Some charsets are simply not supported (ex: fr:iso-8859-1), some are
temporarily deactivated (ex: hu:iso-8859-2) and some are wrongly
detected as closely related charsets.
These were broken (or not efficient) from the start, and there is no
need to pollute the `make test` output with these, which may make us
miss when actual regressions will occur. So let's hide these away for
now until we can improve the situation.
2015-11-18 20:02:58 +01:00
Jehan
4b38e68aa2 CMake tests: separate the lang and charset with colon...
... rather than an hyphen. It makes it easier to read.
2015-11-18 19:42:35 +01:00
Jehan
35153b1e50 Fixes boolean operation precedence warnings...
... and some minor space issues.
Some explicit parentheses were needed to make precedence obvious.
Warning was:
"warning: suggest parentheses around ‘&&’ within ‘||’ [-Wparentheses]"
2015-11-18 19:38:12 +01:00
Jehan
0d70a36910 Adding some more test files for Russian and Chinese.
Taken from:
https://zh.wikipedia.org/wiki/EUC
https://ru.wikipedia.org/wiki/КОИ-8
And rename a file s/utf8.txt/utf-8.txt/ to fix a build test.
2015-11-18 19:27:38 +01:00
Jehan
eb727d3aca Add automatic testing against every test file. 2015-11-18 18:18:27 +01:00
Jehan
f303a41735 Add Thai test file for UTF-8.
Text from Thai Wikipedia:
https://th.wikipedia.org/wiki/ยูนิโคด
2015-11-18 03:26:34 +01:00
Jehan
9d9257072a s/windows-1255/WINDOWS-1255/ to follow iconv uppercase naming. 2015-11-18 03:21:34 +01:00
Jehan
e7c8114233 Add Hebrew test files.
Texts from Hebrew Wikipedia:
https://he.wikipedia.org/wiki/עברית
https://he.wikipedia.org/wiki/ISO_8859
https://he.wikipedia.org/wiki/UTF-8
uchardet fails to detect the ISO-8859-8 files and detects it as
Windows-1255, which is probably acceptable since it is apparently
an "almost compatible superset". It may be worth trying to make
more complete test files in the future to demonstrate the differences.
2015-11-18 03:16:18 +01:00
Jehan
601e59bd83 Add Greek test files.
Taken from Greek Wikipedia:
https://el.wikipedia.org/wiki/UTF-8
https://el.wikipedia.org/wiki/ISO_8859-7
https://el.wikipedia.org/wiki/ISO_8859-7#Windows-1253
Windows-1253 test fails and returns "ISO-8859-7". They are actually
fairly close for main letters, except for Ά, which make them difficult
to differentiate.
2015-11-18 02:57:09 +01:00
Jehan
c8532f63a8 Adding UTF-8 file for Korean.
Text taken from Korean Wikipedia:
https://ko.wikipedia.org/wiki/UTF-8
2015-11-18 02:36:33 +01:00
Jehan
1a58fa6d99 Update AUTHORS. 2015-11-17 21:51:59 +01:00
Jehan
4db0d55692 URL of related project python-chardet has changed. 2015-11-17 21:40:44 +01:00
Jehan
a76c0786b3 Adding test files for main Japanese encoding...
... taken from the following Japanese Wikipedia pages:
https://ja.wikipedia.org/wiki/Extended_Unix_Code
https://ja.wikipedia.org/wiki/ISO/IEC_2022
https://ja.wikipedia.org/wiki/UTF-8
2015-11-17 21:24:47 +01:00
Jehan
0efcdfa546 Reorganize test files in language subdirectories.
I realize that the language information a text has been written in is
very important since it would completely change the character
distribution. Our test files should take this into account, and we
should create several test files in different languages for encoding
used in various languages.
2015-11-17 21:12:39 +01:00
Jehan
192b0e7d51 Add test files for ISO-8859-[12].
Taken from French page about ISO-8859-1:
https://fr.wikipedia.org/wiki/ISO_8859-1
... and Hungarian Wikipedia page about ISO-8859-2:
https://hu.wikipedia.org/wiki/ISO/IEC_8859-2
We don't have support for ISO-8859-1, and both these files are detected
as "WINDOWS-1252" (which is acceptable for iso-8859-1.txt since
Windows-1252 is a superset of ISO-8859-1). ISO-8859-2 support is
disabled because the ISO-8859-1 file would be detected as ISO-8859-2,
which would in turn be a clear error.
2015-11-17 19:39:58 +01:00
Jehan
3f3f4b8011 Add a ISO-8859-5 test file.
Text taken from Russian Wikipedia page about ISO-8859-5:
https://ru.wikipedia.org/wiki/ISO_8859-5
2015-11-17 19:11:59 +01:00
Jehan
bafccfcea8 Add a Windows-1251 test files.
Texts taken from Bulgarian Wikipedia page about Windows-1251:
https://bg.wikipedia.org/wiki/Windows-1251
... and Russian Wikipedia page about Windows-1251:
https://ru.wikipedia.org/wiki/Windows-1251
The Bulgarian file detection is right, but the Russian detection
returns "MAC-CYRILLIC", which is an error and should be fixed.
2015-11-17 19:09:37 +01:00
Jehan
41f3b757f1 Some more encoding names changed to be iconv-compatible.
I forgot to fix some names.
In particular "x-mac-cyrillic" is not valid in iconv, and has been
changed to "MAC-CYRILLIC".
2015-11-17 18:51:45 +01:00
Jehan
8216f7b395 Add an ISO-2022-KR test file.
Text taken from Korean Wikipedia page about the ISO-2022-KR:
https://ko.wikipedia.org/wiki/ISO/IEC_2022
2015-11-17 18:23:46 +01:00
Jehan
ad4dfc4be4 Add a BUILD_STATIC CMake option to optionally build a static library.
It is still ON by default, which means both shared and static libs will
be built and installed (current behavior), but it makes it possible to
disable the build of a static lib.
Closes https://github.com/BYVoid/uchardet/issues/1.
2015-11-17 18:14:51 +01:00
Jehan
9172b763d1 Add TIS-620 in README (Thai language) and a test file.
Test text based on Thai Wikipedia page about the TIS-620 encoding:
https://th.wikipedia.org/wiki/TIS-620
2015-11-17 17:39:45 +01:00
Jehan
399c4c4d9e Add libchardet in related projects.
See https://github.com/BYVoid/uchardet/issues/11
for review of differences with uchardet.
2015-11-17 17:12:44 +01:00
Jehan
362e36d1ed Add EUC-KR test file.
Contains text taken from Wikipedia on EUC-KR page in Korean.
https://ko.wikipedia.org/wiki/EUC-KR
I added it as a simili-subtitle file because as the original Mozilla
paper says: "The input text may contain extraneous noises which have no
relation to its encoding, e.g. HTML tags, non-native words".
Therefore I feel it is important to have test files a little noisy if
possible, in order to test our resistance to noise in our algorithm.
2015-11-17 16:36:17 +01:00
Jehan
dc371f3ba9 uchardet_get_charset() must return iconv-compatible names.
It was not clear if our naming followed any kind of rules. In particular,
iconv is a widely used encoding conversion API. We will follow its
naming.
At least 1 returned name was found invalid: x-euc-tw instead of EUC-TW.
Other names have been uppercased to follow naming from `iconv --list`
though iconv is mostly case-insensitive so it should not have been a
problem. "Just in case".
Prober names can still have free naming (only used for output display
apparently).
Finally HZ-GB-2312 is absent from my iconv list, but I can still see
this encoding in libiconv master code with this name. So I will
consider it valid.
2015-11-17 16:15:21 +01:00
Jehan
256d1957b2 uchardet_get_charset() should never return NULL...
... to stay backward compatible with previous behavior.
About detection failure, our in-code documentation says:
"@return name of charset on success and "" on failure or pure ascii."
This behavior had been broken by commit 3a518c0, which returned NULL
instead. Our command-line tool was the first victim, segfaulting on
ASCII files.
2015-11-16 17:33:16 +01:00
Jehan
d0ccdd5db9 Release: version 0.0.2. v0.0.2 2015-11-16 15:56:45 +01:00
Carbo Kuo
016eb18437 Merge pull request #15 from wang-bin/c++abi
do not use std::string which breaks c++ abi
2015-11-09 20:04:21 +01:00
Carbo Kuo
124d99bcd7 Merge pull request #9 from Jehan/master
(void) and () empty arguments are different in C.
2015-11-09 20:01:22 +01:00
Carbo Kuo
6d562268c3 Merge pull request #13 from cicku/patch-1
Refine Description in pkgconfig file
2015-11-09 19:58:47 +01:00
wang-bin
3a518c0536 do not use std::string which breaks c++ abi
Some stl types can break abi. If the program is built with g++ 5, and libstdc++ on the target platform is g++ 4.x, then it can not run
2015-11-04 18:16:24 +08:00
Christopher Meng
a55c6d26af Refine Description in pkgconfig file 2015-09-21 09:37:36 +08:00
Jehan
ba97505efc (void) and () empty arguments are different in C.
This fixes the following warning when including uchardet.h in C source,
built with -Wstrict-prototypes:
`uchardet.h:52:1: warning: function declaration isn't a prototype`
2015-09-05 15:58:56 +02:00
Carbo Kuo
84e292d1b9 Merge pull request #8 from wm4/header_fixes
Header conformance fixes
2015-08-06 12:33:03 +02:00
wm4
d59294a00e Header conformance fixes
Identifiers starting with __ are reserved for the system - user code
(including non-system libraries) must not define them.

A function which takes no parameters is declared with "(void)". In C, an
empty parameter list means that any number of parameters with
unspecified types is allowed, which is not what we want in this case.
Another reason to fix this is that compilers often warn if this legacy
feature is used, which is bothersome for API users.

Additionally, use an opaque struct as underlying type for uchardet_t.
This facilitates type-checking, as it's harder to confuse with other
types, especially in C. This is not strictly a conformance issue, but
still a nice change. Note that this is neither an API or an ABI change.
2015-08-05 22:24:49 +02:00
Carbo Kuo
47316bb194 Merge pull request #5 from llloic11/master
Multple filename in command line
2015-07-20 15:51:52 +02:00
Loic Le Loarer
07af96b3a7 Use perror for error report 2015-07-16 01:20:03 +02:00
Loic Le Loarer
1c89a2f8ff Use stdin by default as before 2015-07-16 01:15:08 +02:00
Loic Le Loarer
972d061e90 Allow multiple filename in the command line 2015-07-16 00:59:58 +02:00
Carbo Kuo
5653243699 Merge pull request #4 from nu774/fixes
Fixes
2015-06-22 16:09:54 +02:00
nu774
f5637b23b8 fix for MinGW build 2015-06-20 12:28:01 +09:00
nu774
ba6679f2b3 fix: export symbols were not passed to the linker as intended 2015-06-20 12:28:01 +09:00
Carbo Kuo
69b7133995 Add a link to rust-uchardet on README 2014-11-20 20:06:41 +01:00
Hoa V. DINH
56b8581a70 Added shared scheme 2014-10-26 00:24:03 -07:00
Hoa V. Dinh
e2dd66aa30 Build for Mac 2014-10-25 09:24:59 -07:00
Carbo Kuo
6caa8f6580 Add README 2013-11-08 07:02:50 +08:00
BYVoid
56a4c0d86c Add authors. 2011-07-13 20:16:23 +08:00
BYVoid
00177ab024 Add description of installation. 2011-07-11 23:14:57 +08:00
BYVoid
b60936abc5 Fix a wrong spell in manpage. 2011-07-11 18:15:58 +08:00
byvoid
eaab1d7868 Set permissions. 2011-07-11 18:08:26 +08:00