Jehan 7f290975ba BuildLangModel: map different cases of the same character together.
With the new case_mapping lang property, we can consider upper and lower
case versions of the same character as one character.
This makes sense in some language, and would allow to enter some rarer
characters (but still in the main alphabet) inside the frequent
character list. For instance 'œ' and 'Œ' in French.
2015-11-29 02:14:48 +01:00
build-mac Added shared scheme 2014-10-26 00:24:03 -07:00
doc Update README and manual... 2015-11-27 18:27:11 +01:00
script BuildLangModel: map different cases of the same character together. 2015-11-29 02:14:48 +01:00
src Differentiate ASCII and detection failure. 2015-11-28 17:04:52 +01:00
test Add UTF-16 test files without BOM... 2015-11-28 19:50:18 +01:00
AUTHORS Update AUTHORS. 2015-11-17 21:51:59 +01:00
CMakeLists.txt Release: version 0.0.3. 2015-11-19 15:18:11 +01:00
COPYING Add authors. 2011-07-13 20:16:23 +08:00
INSTALL Add description of installation. 2011-07-11 23:14:57 +08:00
README.md Update README: Unicode is "International". 2015-11-28 19:44:13 +01:00
uchardet.pc.in Refine Description in pkgconfig file 2015-09-21 09:37:36 +08:00

uchardet

uchardet is a C language binding of the original C++ implementation of the universal charset detection library by Mozilla.

uchardet is an encoding detector library, which takes a sequence of bytes in an unknown character encoding without any additional information, and attempts to determine the encoding of the text. Returned encoding names are iconv-compatible.

The original code of universalchardet is available at http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/

Techniques used by universalchardet are described at http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Supported Encodings

  • International (Unicode)
    • UTF-8
    • UTF-16BE / UTF-16LE
    • UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
  • Chinese
    • ISO-2022-CN
    • BIG5
    • EUC-TW
    • GB18030
    • HZ-GB-2312
  • Japanese
    • ISO-2022-JP
    • SHIFT_JIS
    • EUC-JP
  • Korean
    • ISO-2022-KR
    • EUC-KR
  • Cyrillic
    • ISO-8859-5
    • KOI8-R
    • WINDOWS-1251
    • MAC-CYRILLIC
    • IBM866
    • IBM855
  • Greek
    • ISO-8859-7
    • WINDOWS-1253
  • Hebrew
    • ISO-8859-8
    • WINDOWS-1255
  • Thai
    • TIS-620
  • French
    • ISO-8859-1
    • ISO-8859-15
  • English
    • ASCII
  • Others
    • WINDOWS-1252

Installation

Debian/Ubuntu/Mint

apt-get install uchardet libuchardet-dev

Mageia

urpmi libuchardet libuchardet-devel

Mac

brew install uchardet

Build from source

cmake .
make
make install

Usage

Command Line

uchardet Command Line Tool
Version 0.0.3

Author: BYVoid
Bug Report: http://code.google.com/p/uchardet/issues/entry

Usage:
 uchardet [Options] [File]...

Options:
 -v, --version         Print version and build information.
 -h, --help            Print this help.

Library

See uchardet.h

License

Mozilla Public License Version 1.1