From c8a3572cca834d687b478522385530645a261d40 Mon Sep 17 00:00:00 2001 From: Jehan Date: Wed, 29 Apr 2020 16:12:54 +0200 Subject: [PATCH] Issue #17: update README. Replace the old link to the science paper by one on archive-mozilla website. Remove the original source link as I can't find any archived version of it (even on archive.org, only the folder structure is saved, not actual files themselves, so it's useless). Also add some history, which is probably a nice touch. Add a link to crossroad to help people who'd want to cross-compile uchardet. Finally add the R binding by Artem Klevtsov and QtAV as reported. --- README.md | 41 ++++++++++++++++++++++++++++++++++++----- 1 file changed, 36 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index a2713ae..bf09091 100644 --- a/README.md +++ b/README.md @@ -4,10 +4,6 @@ uchardet started as a C language binding of the original C++ implementation of the universal charset detection library by Mozilla. It can now detect more charsets, and more reliably than the original implementation. -The original code of universalchardet is available at http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/ - -Techniques used by universalchardet are described at http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html - ## Supported Languages/Encodings * International (Unicode) @@ -194,7 +190,8 @@ to use MinGW-w64 instead of MinGW, in particular to build both 32 and 64-bit DLL libraries). Note also that it is very easily cross-buildable (for instance from a -GNU/Linux machine). +GNU/Linux machine; [crossroad](https://pypi.org/project/crossroad/) may +help, this is what we use in our CI). ### Build from source @@ -254,8 +251,41 @@ Options: See [uchardet.h](https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/master/src/uchardet.h) +## History + +As said in introduction, this was initially a project of Mozilla to +allow better detection of page encodings, and it used to be part of +Firefox. If not mistaken, this is not the case anymore (probably because +nowadays most websites better announce their encoding, and also UTF-8 is +much more widely spread). + +Techniques used by universalchardet are described at https://www-archive.mozilla.org/projects/intl/universalcharsetdetection + +It is to be noted that a lot has changed since the original code, yet +the base concept is still around, basing detection not just on encoding +rules, but importantly on analysis of character statistics in languages. + +Original code by Mozilla does not seem to be found anymore anywhere, but +it's probably not too far from the initial commit of this repository. + +Mozilla code was extracted and packaged into a standalone library under +the name `uchardet` by BYVoid in 2011, in a personal repository. +Starting 2015, I (i.e. Jehan) started contributing, "standardized" +the output to be iconv-compatible, added various encoding/language +support and streamlined generation of sources for new support of +encoding/languages by using texts from Wikipedia as statistics source on +languages through Python scripts. Then I soon became co-maintainer. +In 2016, `uchardet` became a freedesktop project. + ## Related Projects +Some of these are bindings of `uchardet`, others are forks of the same +initial code, which has diverged over time, others are native port in +other languages. +This list is not exhaustive and only meant as point of interest. We +don't follow the status for these projects. + + * [R-uchardet](https://cran.r-project.org/package=uchardet) R binding on CRAN * [python-chardet](https://github.com/chardet/chardet) Python port * [ruby-rchardet](http://rubyforge.org/projects/chardet/) Ruby port * [juniversalchardet](http://code.google.com/p/juniversalchardet/) Java port of universalchardet @@ -272,6 +302,7 @@ See [uchardet.h](https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/master/ * [Tepl](https://wiki.gnome.org/Projects/Tepl) * [Nextcloud IOS app](https://github.com/nextcloud/ios) * [Codelite](https://codelite.org) +* [QtAV](https://www.qtav.org/) * … ## Licenses