mirror of
https://gitlab.freedesktop.org/uchardet/uchardet.git
synced 2025-12-06 08:46:40 +08:00
66 lines
2.6 KiB
Plaintext
66 lines
2.6 KiB
Plaintext
# Supporting new or Updating languages #
|
|
|
|
We generate statistical language data using Wikipedia as natural
|
|
language text resource.
|
|
|
|
Right now, we have automated scripts only to generate statistical data
|
|
for single-byte encodings. Multi-byte encodings usually requires more
|
|
in-depth knowledge of its specification.
|
|
|
|
## New single-byte encoding ##
|
|
|
|
Uchardet uses language data, and therefore rather than supporting a
|
|
charset, we in fact support a couple (language, charset). So for
|
|
instance if uchardet supports (French, ISO-8859-15), it should be able
|
|
to recognize French text encoded in ISO-8859-15, but may fail at
|
|
detecting ISO-8859-15 for non-supported languages.
|
|
|
|
This is why, though less flexible, it also makes uchardet much more
|
|
accurate than other detection systems, as well as making it an efficient
|
|
language recognition system.
|
|
Since many single-byte charsets actually share the same layout (or very
|
|
similar ones), it is actually impossible to have an accurate single-byte
|
|
encoding detector for random text.
|
|
|
|
Therefore you need to describe the language and the codepoint layouts of
|
|
every charset you want to add support for.
|
|
|
|
I recommend having a look at langs/fr.py which is heavily commented as
|
|
a base of a new language description, and charsets/windows-1252.py as a
|
|
base for a new charset layout (note that charset layouts can be shared
|
|
between languages. If yours is already there, you have nothing to do).
|
|
The important name in the charset file are:
|
|
|
|
- `name`: an iconv-compatible name.
|
|
- `charmap`: fill it with CTR (control character), SYM (symbol), NUM
|
|
(number), LET (letter), ILL (illegal codepoint).
|
|
|
|
## Tools ##
|
|
|
|
You must install Python 3 and the [`Wikipedia` Python
|
|
tool](https://github.com/goldsmith/Wikipedia).
|
|
|
|
If requirements change, these will be updated in `requirements.txt`, so that you
|
|
can just run `pip3 install -r requirements.txt`.
|
|
|
|
## Run script ##
|
|
|
|
Let's say you added (or modified) support for French (`fr`), run:
|
|
|
|
> ./BuildLangModel.py fr --max-page=200 --max-depth=4
|
|
|
|
The options can be changed to any value. Bigger values mean the script
|
|
will process more data, so more processing time now, but uchardet may
|
|
possibly be more accurate in the end.
|
|
|
|
## Updating core code ##
|
|
|
|
If you were only updating data for an existing language model, you have nothing
|
|
else to do. Just build `uchardet` again and test it.
|
|
|
|
If you were creating new models though, you will have to add the sequence models
|
|
in src/nsSBCSGroupProber.cpp and the language model in src/nsMBCSGroupProber.cpp.
|
|
Finally add the new file in src/CMakeLists.txt.
|
|
|
|
I will be looking to make this step more straightforward in the future.
|