mirror of
https://gitlab.freedesktop.org/uchardet/uchardet.git
synced 2025-12-08 01:36:41 +08:00
Update README.
This commit is contained in:
parent
7459a4d9b3
commit
406e1d0b29
125
README.md
125
README.md
@ -1,8 +1,11 @@
|
||||
# uchardet
|
||||
|
||||
[uchardet](https://www.freedesktop.org/wiki/Software/uchardet/) is an encoding detector library, which takes a sequence of bytes in an unknown character encoding without any additional information, and attempts to determine the encoding of the text. Returned encoding names are [iconv](https://www.gnu.org/software/libiconv/)-compatible.
|
||||
[uchardet](https://www.freedesktop.org/wiki/Software/uchardet/) is an encoding and language detector library, which takes a sequence of bytes in an unknown character encoding without any additional information, and attempts to determine the encoding of the text.
|
||||
|
||||
uchardet started as a C language binding of the original C++ implementation of the universal charset detection library by Mozilla. It can now detect more charsets, and more reliably than the original implementation.
|
||||
* Returned encoding names are [iconv](https://www.gnu.org/software/libiconv/)-compatible.
|
||||
* Returned language codes are ISO 639-1.
|
||||
|
||||
uchardet started as a C language binding of the original C++ implementation of the universal charset detection library by Mozilla. Since this far-away time, it can now detect more charsets, and much more reliably than the original implementation. Moreover it also work as a very good language detector, while still staying reasonably fast.
|
||||
|
||||
## Supported Languages/Encodings
|
||||
|
||||
@ -11,6 +14,7 @@ uchardet started as a C language binding of the original C++ implementation of t
|
||||
* UTF-16BE / UTF-16LE
|
||||
* UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
|
||||
* Arabic
|
||||
* UTF-8
|
||||
* ISO-8859-6
|
||||
* WINDOWS-1256
|
||||
* Bulgarian
|
||||
@ -23,6 +27,7 @@ uchardet started as a C language binding of the original C++ implementation of t
|
||||
* GB18030
|
||||
* HZ-GB-2312
|
||||
* Croatian:
|
||||
* UTF-8
|
||||
* ISO-8859-2
|
||||
* ISO-8859-13
|
||||
* ISO-8859-16
|
||||
@ -30,25 +35,30 @@ uchardet started as a C language binding of the original C++ implementation of t
|
||||
* IBM852
|
||||
* MAC-CENTRALEUROPE
|
||||
* Czech
|
||||
* UTF-8
|
||||
* Windows-1250
|
||||
* ISO-8859-2
|
||||
* IBM852
|
||||
* MAC-CENTRALEUROPE
|
||||
* Danish
|
||||
* UTF-8
|
||||
* ISO-8859-1
|
||||
* ISO-8859-15
|
||||
* WINDOWS-1252
|
||||
* English
|
||||
* ASCII
|
||||
* Esperanto
|
||||
* UTF-8
|
||||
* ISO-8859-3
|
||||
* Estonian
|
||||
* UTF-8
|
||||
* ISO-8859-4
|
||||
* ISO-8859-13
|
||||
* ISO-8859-13
|
||||
* Windows-1252
|
||||
* Windows-1257
|
||||
* Finnish
|
||||
* UTF-8
|
||||
* ISO-8859-1
|
||||
* ISO-8859-4
|
||||
* ISO-8859-9
|
||||
@ -56,27 +66,36 @@ uchardet started as a C language binding of the original C++ implementation of t
|
||||
* ISO-8859-15
|
||||
* WINDOWS-1252
|
||||
* French
|
||||
* UTF-8
|
||||
* ISO-8859-1
|
||||
* ISO-8859-15
|
||||
* WINDOWS-1252
|
||||
* German
|
||||
* UTF-8
|
||||
* ISO-8859-1
|
||||
* WINDOWS-1252
|
||||
* Greek
|
||||
* UTF-8
|
||||
* ISO-8859-7
|
||||
* WINDOWS-1253
|
||||
* Hebrew
|
||||
* UTF-8
|
||||
* ISO-8859-8
|
||||
* WINDOWS-1255
|
||||
* Hindi
|
||||
* UTF-8
|
||||
* Hungarian:
|
||||
* UTF-8
|
||||
* ISO-8859-2
|
||||
* WINDOWS-1250
|
||||
* Irish Gaelic
|
||||
* UTF-8
|
||||
* ISO-8859-1
|
||||
* ISO-8859-9
|
||||
* ISO-8859-15
|
||||
* WINDOWS-1252
|
||||
* Italian
|
||||
* UTF-8
|
||||
* ISO-8859-1
|
||||
* ISO-8859-3
|
||||
* ISO-8859-9
|
||||
@ -87,19 +106,25 @@ uchardet started as a C language binding of the original C++ implementation of t
|
||||
* SHIFT_JIS
|
||||
* EUC-JP
|
||||
* Korean
|
||||
* UTF-8
|
||||
* ISO-2022-KR
|
||||
* EUC-KR / UHC
|
||||
* Lithuanian
|
||||
* Johab
|
||||
* Latvian
|
||||
* UTF-8
|
||||
* ISO-8859-4
|
||||
* ISO-8859-10
|
||||
* ISO-8859-13
|
||||
* Latvian
|
||||
* Lithuanian
|
||||
* UTF-8
|
||||
* ISO-8859-4
|
||||
* ISO-8859-10
|
||||
* ISO-8859-13
|
||||
* Maltese
|
||||
* UTF-8
|
||||
* ISO-8859-3
|
||||
* Polish:
|
||||
* UTF-8
|
||||
* ISO-8859-2
|
||||
* ISO-8859-13
|
||||
* ISO-8859-16
|
||||
@ -107,11 +132,13 @@ uchardet started as a C language binding of the original C++ implementation of t
|
||||
* IBM852
|
||||
* MAC-CENTRALEUROPE
|
||||
* Portuguese
|
||||
* UTF-8
|
||||
* ISO-8859-1
|
||||
* ISO-8859-9
|
||||
* ISO-8859-15
|
||||
* WINDOWS-1252
|
||||
* Romanian:
|
||||
* UTF-8
|
||||
* ISO-8859-2
|
||||
* ISO-8859-16
|
||||
* Windows-1250
|
||||
@ -124,33 +151,40 @@ uchardet started as a C language binding of the original C++ implementation of t
|
||||
* IBM866
|
||||
* IBM855
|
||||
* Slovak
|
||||
* UTF-8
|
||||
* Windows-1250
|
||||
* ISO-8859-2
|
||||
* IBM852
|
||||
* MAC-CENTRALEUROPE
|
||||
* Slovene
|
||||
* UTF-8
|
||||
* ISO-8859-2
|
||||
* ISO-8859-16
|
||||
* Windows-1250
|
||||
* IBM852
|
||||
* MAC-CENTRALEUROPE
|
||||
* Spanish
|
||||
* UTF-8
|
||||
* ISO-8859-1
|
||||
* ISO-8859-15
|
||||
* WINDOWS-1252
|
||||
* Swedish
|
||||
* UTF-8
|
||||
* ISO-8859-1
|
||||
* ISO-8859-4
|
||||
* ISO-8859-9
|
||||
* ISO-8859-15
|
||||
* WINDOWS-1252
|
||||
* Thai
|
||||
* UTF-8
|
||||
* TIS-620
|
||||
* ISO-8859-11
|
||||
* Turkish:
|
||||
* UTF-8
|
||||
* ISO-8859-3
|
||||
* ISO-8859-9
|
||||
* Vietnamese:
|
||||
* UTF-8
|
||||
* VISCII
|
||||
* Windows-1258
|
||||
* Others
|
||||
@ -236,9 +270,13 @@ Here is a working "module" section to include in your Flatpak's json manifest:
|
||||
|
||||
### Command Line
|
||||
|
||||
uchardet comes with a command line tool which obviously uses its own
|
||||
library. It can be considered as a demo of `libuchardet` even though one
|
||||
can find it very useful on its own right to inspect files.
|
||||
|
||||
```
|
||||
uchardet Command Line Tool
|
||||
Version 0.0.7
|
||||
Version 0.1.0
|
||||
|
||||
Authors: BYVoid, Jehan
|
||||
Bug Report: https://gitlab.freedesktop.org/uchardet/uchardet/-/issues
|
||||
@ -249,6 +287,8 @@ Usage:
|
||||
Options:
|
||||
-v, --version Print version and build information.
|
||||
-h, --help Print this help.
|
||||
-V, --verbose Show all candidates and their confidence value.
|
||||
-w, --weight Tweak language weights.
|
||||
```
|
||||
|
||||
### Library
|
||||
@ -261,25 +301,70 @@ As said in introduction, this was initially a project of Mozilla to
|
||||
allow better detection of page encodings, and it used to be part of
|
||||
Firefox. If not mistaken, this is not the case anymore (probably because
|
||||
nowadays most websites better announce their encoding, and also UTF-8 is
|
||||
much more widely spread).
|
||||
much more widely spread) and the original code has been abandoned.
|
||||
|
||||
Techniques used by universalchardet are described at https://www-archive.mozilla.org/projects/intl/universalcharsetdetection
|
||||
|
||||
It is to be noted that a lot has changed since the original code, yet
|
||||
the base concept is still around, basing detection not just on encoding
|
||||
rules, but importantly on analysis of character statistics in languages.
|
||||
It is to be noted that a lot has changed since the original
|
||||
implementation, yet the base concept is still the same, basing detection
|
||||
not just on encoding rules, but most importantly on analysis of
|
||||
character statistics in languages.
|
||||
|
||||
Original code by Mozilla does not seem to be found anymore anywhere, but
|
||||
it's probably not too far from the initial commit of this repository.
|
||||
|
||||
Mozilla code was extracted and packaged into a standalone library under
|
||||
the name `uchardet` by BYVoid in 2011, in a personal repository.
|
||||
Starting 2015, I (i.e. Jehan) started contributing, "standardized"
|
||||
the output to be iconv-compatible, added various encoding/language
|
||||
support and streamlined generation of sources for new support of
|
||||
encoding/languages by using texts from Wikipedia as statistics source on
|
||||
languages through Python scripts. Then I soon became co-maintainer.
|
||||
In 2016, `uchardet` became a freedesktop project.
|
||||
1. Mozilla code was extracted and packaged into a standalone library under
|
||||
the name `uchardet` by BYVoid in 2011, in a personal repository.
|
||||
2. Starting 2015, I (i.e. Jehan) started contributing, "standardized"
|
||||
the output to be iconv-compatible, added various encoding/language
|
||||
support and streamlined generation of sources for new support of
|
||||
encoding/languages by using texts from Wikipedia as statistics source
|
||||
on languages through Python scripts. I soon became co-maintainer.
|
||||
3. In 2016, `uchardet` became a freedesktop project.
|
||||
4. Since 2015, the number of supported encoding continuously increased,
|
||||
in particular version 0.0.6 (2016) and especially 0.0.7 (2020) added
|
||||
a lot of new supported charset-language couples.
|
||||
5. In 2021, I added language detection support.
|
||||
|
||||
## Techniques used
|
||||
|
||||
Techniques used originally by universalchardet are described at:
|
||||
https://www-archive.mozilla.org/projects/intl/universalcharsetdetection
|
||||
|
||||
As said in the "*History*" section, the base algorithm is still there,
|
||||
helping detection of charset with analysis of character statistics in
|
||||
languages.
|
||||
|
||||
This is also why it could evolve in a quite efficient language detector.
|
||||
|
||||
Furthermore it does not use any dictionary, doesn't do semantics, or
|
||||
nothing of the sort. The drawback of this is that it can be wrong
|
||||
sometimes, especially on very short texts (a few words) when we don't
|
||||
have enough data to differentiate while a word search in a dictionnary
|
||||
could have done the trick. The advantages are that it makes it perform
|
||||
much faster, with very small memory usage while still being extremely
|
||||
performant on discriminating among a lot of charsets and languages when
|
||||
your text is long enough.
|
||||
|
||||
## Supporting the project financially
|
||||
|
||||
I don't have a specific job around uchardet but I work on making Free
|
||||
Software exclusively. In particular I develop
|
||||
[GIMP](https://www.gimp.org/) and other Free Software within
|
||||
[ZeMarmot](https://film.zemarmot.net/) project.
|
||||
Thus uchardet is just one of the many FLOSS code I make.
|
||||
|
||||
So if you want to support my Free Software code, I suggest to donate to
|
||||
*ZeMarmot* in one of these ways:
|
||||
|
||||
* Liberapay: https://liberapay.com/ZeMarmot/
|
||||
* Patreon: https://www.patreon.com/zemarmot
|
||||
* Tipeee: https://en.tipeee.com/zemarmot
|
||||
* Other (Paypal, bank transfer…): https://film.zemarmot.net/en/donate
|
||||
|
||||
It might sound weird to fund a Libre Art animation film (Creative
|
||||
Commons by-sa) to support the development of uchardet, but this is
|
||||
exactly what happens if you do, as part of the donation go into salary
|
||||
for me. And we need more funding to continue working on Free Software
|
||||
for a living.
|
||||
|
||||
## Related Projects
|
||||
|
||||
@ -303,7 +388,7 @@ don't follow the status for these projects.
|
||||
## Used by
|
||||
|
||||
* [mpv](https://mpv.io/) for subtitle detection
|
||||
* [Tepl](https://wiki.gnome.org/Projects/Tepl)
|
||||
* [Tepl](https://wiki.gnome.org/Projects/Tepl) (gedit…)
|
||||
* [Nextcloud IOS app](https://github.com/nextcloud/ios)
|
||||
* [Codelite](https://codelite.org)
|
||||
* [QtAV](https://www.qtav.org/)
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user