mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-06-15 08:26:15 +08:00

Go to file

Jehan e0eec3bae8 src: give a little weight to "probable sequences". Up to now, we were only considering positive sequences, which are sequences of 2 characters which happen the most. Yet our data gather 4 categories of sequences (the last one being called "negative", since they never happened in our data). I will call the category below positive: probable sequences. They may happen, yet not often. The last category could be called "neutral". This seems to fix the detection of a user's subtitle example without breaking any of our current unit tests. Probably I should still review this whole logics more in details later.		2016-05-25 17:38:20 +02:00
build-mac	Added shared scheme	2014-10-26 00:24:03 -07:00
doc	cmake: use GNUInstallDirs cmake module	2016-03-22 01:23:03 +03:00
script	script: stupid bug on BuildLangModel.py.	2016-05-25 15:23:36 +02:00
src	src: give a little weight to "probable sequences".	2016-05-25 17:38:20 +02:00
test	cmake: hardcode less	2016-03-22 01:23:04 +03:00
.gitignore	Add a .gitignore.	2015-11-29 02:27:42 +01:00
AUTHORS	Update authors.	2015-12-03 19:44:13 +01:00
CMakeLists.txt	cmake: use lowercase suffix for debug build	2016-03-22 01:23:05 +03:00
COPYING	Add authors.	2011-07-13 20:16:23 +08:00
INSTALL	Add description of installation.	2011-07-11 23:14:57 +08:00
README.md	README: add Danish support for 3 charsets.	2016-02-19 19:11:56 +01:00
uchardet.doap	Add a DOAP file.	2016-02-21 15:19:50 +01:00
uchardet.pc.in	pkg-config: use GNUInstallDirs CMAKE_ variables in pc.in template.	2016-03-27 20:31:58 +02:00

README.md

uchardet

uchardet is an encoding detector library, which takes a sequence of bytes in an unknown character encoding without any additional information, and attempts to determine the encoding of the text. Returned encoding names are iconv-compatible.

uchardet started as a C language binding of the original C++ implementation of the universal charset detection library by Mozilla. It can now detect more charsets, and more reliably than the original implementation.

The original code of universalchardet is available at http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/

Techniques used by universalchardet are described at http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Supported Languages/Encodings

International (Unicode)
- UTF-8
- UTF-16BE / UTF-16LE
- UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
Arabic
- ISO-8859-6
- WINDOWS-1256
Bulgarian
- ISO-8859-5
- WINDOWS-1251
Chinese
- ISO-2022-CN
- BIG5
- EUC-TW
- GB18030
- HZ-GB-2312
Danish
- ISO-8859-1
- ISO-8859-15
- WINDOWS-1252
English
- ASCII
Esperanto
- ISO-8859-3
French
- ISO-8859-1
- ISO-8859-15
- WINDOWS-1252
German
- ISO-8859-1
- WINDOWS-1252
Greek
- ISO-8859-7
- WINDOWS-1253
Hebrew
- ISO-8859-8
- WINDOWS-1255
Hungarian:
- ISO-8859-2
- WINDOWS-1250
Japanese
- ISO-2022-JP
- SHIFT_JIS
- EUC-JP
Korean
- ISO-2022-KR
- EUC-KR
Russian
- ISO-8859-5
- KOI8-R
- WINDOWS-1251
- MAC-CYRILLIC
- IBM866
- IBM855
Spanish
- ISO-8859-1
- ISO-8859-15
- WINDOWS-1252
Thai
- TIS-620
- ISO-8859-11
Turkish:
- ISO-8859-3
- ISO-8859-9
Vietnamese:
- VISCII
- Windows-1258
Others
- WINDOWS-1252

Installation

Debian/Ubuntu/Mint

apt-get install uchardet libuchardet-dev

Mageia

urpmi libuchardet libuchardet-devel

Fedora

dnf install uchardet uchardet-devel

Mac

brew install uchardet

Build from source

cmake .
make
make install

Usage

Command Line

uchardet Command Line Tool
Version 0.0.5

Authors: BYVoid, Jehan
Bug Report: https://github.com/BYVoid/uchardet/issues

Usage:
 uchardet [Options] [File]...

Options:
 -v, --version         Print version and build information.
 -h, --help            Print this help.

Library

See uchardet.h

python-chardet Python port
ruby-rchardet Ruby port
juniversalchardet Java port of universalchardet
jchardet Java port of chardet
nuniversalchardet C# port of universalchardet
nchardet C# port of chardet
uchardet-enhanced A fork of mozilla universalchardet
rust-uchardet Rust language binding of uchardet
libchardet Another C/C++ API wrapping Mozilla code.

License

Mozilla Public License Version 1.1