uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-01-01 03:12:24 +08:00

History

Jehan db836fad63 script, src: generate more code for language and sequence model listing. Right now, each time we add new language or new charset support, we have too many pieces of code not to forget to edit. The script script/BuildLangModel.py will now take care of the main parts: listing the sequence models, listing the generic language models and computing the numbers for each listing. Furthermore the script will now end with a TODO list of the parts which are still to be done manually (2 functions to edit and a CMakeLists). Finally the script now allows to give a list of languages to edit rather of having to run it with languages one by one. It also allows 2 special code: "none", which will retrain none of the languages, but will re-generate only the new generated listings; and "all" which will retrain all models (useful in particulare when we change the model formats or usage and want to regenerate everything).		2022-12-18 17:23:34 +01:00
..
BuildLangModelLogs	script, src, test: add Serbian support.	2022-12-17 22:47:54 +01:00
charsets	script, src: regenerate Russian models and add UTF-8/Russian support.	2022-12-17 21:41:11 +01:00
langs	script, src, test: add Serbian support.	2022-12-17 22:47:54 +01:00
BuildLangModel.py	script, src: generate more code for language and sequence model listing.	2022-12-18 17:23:34 +01:00
debug.sh	Add authors.	2011-07-13 20:16:23 +08:00
header-template.cpp	script, src: generate more code for language and sequence model listing.	2022-12-18 17:23:34 +01:00
README	script: add a README file dedicated to adding new support.	2016-02-21 16:06:11 +01:00
release.sh	Add authors.	2011-07-13 20:16:23 +08:00
support.txt	script, src: generate more code for language and sequence model listing.	2022-12-18 17:23:34 +01:00
win32.sh	Add authors.	2011-07-13 20:16:23 +08:00

README

# Supporting new or Updating languages #

We generate statistical language data using Wikipedia as natural
language text resource.

Right now, we have automated scripts only to generate statistical data
for single-byte encodings. Multi-byte encodings usually requires more
in-depth knowledge of its specification.

## New single-byte encoding ##

Uchardet uses language data, and therefore rather than supporting a
charset, we in fact support a couple (language, charset). So for
instance if uchardet supports (French, ISO-8859-15), it should be able
to recognize French text encoded in ISO-8859-15, but may fail at
detecting ISO-8859-15 for non-supported languages.

This is why, though less flexible, it also makes uchardet much more
accurate than other detection system, as well as making it an efficient
language recognition system.
Since many single-byte charsets actually share the same layout (or very
similar ones), it is actually impossible to have an accurate single-byte
encoding detector for random text.

Therefore you need to describe the language and the codepoint layouts of
every charset you want to add support for.

I recommend having a look at langs/fr.py which is heavily commented as
a base of a new language description, and charsets/windows-1252.py as a
base for a new charset layout (note that charset layouts can be shared
between languages. If yours is already there, you have nothing to do).
The important name in the charset file are:

- `name`: an iconv-compatible name.
- `charmap`: fill it with CTR (control character), SYM (symbol), NUM
             (number), LET (letter), ILL (illegal codepoint).

## Tools ##

You must install Python 3 and the [`Wikipedia` Python
tool](https://github.com/goldsmith/Wikipedia).

## Run script ##

Let's say you added (or modified) support for French (`fr`), run:

> ./BuildLangModel.py fr --max-page=100 --max-depth=4

The options can be changed to any value. Bigger values mean the script
will process more data, so more processing time now, but uchardet may
possibly be more accurate in the end.

## Updating core code ##

If you were only updating data for a language model, you have nothing
else to do. Just build `uchardet` again and test it.

If you were creating new models though, you will have to add these in
src/nsSBCSGroupProber.cpp and src/nsSBCharSetProber.h, and increase the
value of `NUM_OF_SBCS_PROBERS` in src/nsSBCSGroupProber.h.
Finally add the new file in src/CMakeLists.txt.

I will be looking to make this step more straightforward in the future.