Jehan dc371f3ba9 uchardet_get_charset() must return iconv-compatible names.
It was not clear if our naming followed any kind of rules. In particular,
iconv is a widely used encoding conversion API. We will follow its
naming.
At least 1 returned name was found invalid: x-euc-tw instead of EUC-TW.
Other names have been uppercased to follow naming from `iconv --list`
though iconv is mostly case-insensitive so it should not have been a
problem. "Just in case".
Prober names can still have free naming (only used for output display
apparently).
Finally HZ-GB-2312 is absent from my iconv list, but I can still see
this encoding in libiconv master code with this name. So I will
consider it valid.
2015-11-17 16:15:21 +01:00
build-mac Added shared scheme 2014-10-26 00:24:03 -07:00
doc Fix a wrong spell in manpage. 2011-07-11 18:15:58 +08:00
script Add authors. 2011-07-13 20:16:23 +08:00
src uchardet_get_charset() must return iconv-compatible names. 2015-11-17 16:15:21 +01:00
test Set permissions. 2011-07-11 18:08:26 +08:00
AUTHORS Add authors. 2011-07-13 20:16:23 +08:00
CMakeLists.txt Release: version 0.0.2. 2015-11-16 15:56:45 +01:00
COPYING Add authors. 2011-07-13 20:16:23 +08:00
INSTALL Add description of installation. 2011-07-11 23:14:57 +08:00
README.md uchardet_get_charset() must return iconv-compatible names. 2015-11-17 16:15:21 +01:00
uchardet.pc.in Refine Description in pkgconfig file 2015-09-21 09:37:36 +08:00

uchardet

uchardet is a C language binding of the original C++ implementation of the universal charset detection library by Mozilla.

uchardet is an encoding detector library, which takes a sequence of bytes in an unknown character encoding without any additional information, and attempts to determine the encoding of the text. Returned encoding names are iconv-compatible.

The original code of universalchardet is available at http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/

Techniques used by universalchardet are described at http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Supported Encodings

  • Unicode
    • UTF-8
    • UTF-16BE / UTF-16LE
    • UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
  • Chinese
    • ISO-2022-CN
    • BIG5
    • EUC-TW
    • GB18030
    • HZ-GB-2312
  • Japanese
    • ISO-2022-JP
    • SHIFT_JIS
    • EUC-JP
  • Korean
    • ISO-2022-KR
    • EUC-KR
  • Cyrillic
    • ISO-8859-5
    • KOI8-R
    • WINDOWS-1251
    • MACCYRILLIC
    • IBM866
    • IBM855
  • Greek
    • ISO-8859-7
    • WINDOWS-1253
  • Hebrew
    • ISO-8859-8
    • WINDOWS-1255
  • Others
    • WINDOWS-1252

Installation

Ubuntu/Debian

apt-get install uchardet libuchardet-dev

Mac

brew install uchardet

Build from source

cmake .
make
make install

Usage

Command Line

uchardet Command Line Tool
Version 0.0.2

Author: BYVoid
Bug Report: http://code.google.com/p/uchardet/issues/entry

Usage:
 uchardet [Options] [File]

Options:
 -v, --version         Print version and build information.
 -h, --help            Print this help.

Library

See uchardet.h

License

Mozilla Public License Version 1.1