uchardet

coffee/uchardet

Fork 0

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-06 16:56:40 +08:00

Commit Graph

Author	SHA1	Message	Date
Jehan	db836fad63	script, src: generate more code for language and sequence model listing. Right now, each time we add new language or new charset support, we have too many pieces of code not to forget to edit. The script script/BuildLangModel.py will now take care of the main parts: listing the sequence models, listing the generic language models and computing the numbers for each listing. Furthermore the script will now end with a TODO list of the parts which are still to be done manually (2 functions to edit and a CMakeLists). Finally the script now allows to give a list of languages to edit rather of having to run it with languages one by one. It also allows 2 special code: "none", which will retrain none of the languages, but will re-generate only the new generated listings; and "all" which will retrain all models (useful in particulare when we change the model formats or usage and want to regenerate everything).	2022-12-18 17:23:34 +01:00
Jehan	b70b1ebf88	Rebuild a bunch of language models. Adding generic language model (see coming commit), which uses the same data as specific single-byte encoding statistics model, except that it applies it to unicode code points. For this to work, instead of the CharToOrderMap which was mapping directly from encoded byte (always 256 values) to order, now we add an array of frequent characters, ordered by generic unicode code points to the order of frequency (which can be used on the same sequence mapping array). This of course means that each prober where we will want to use these generic models will have to implement their own byte to code point decoder, as this is per-encoding logics anyway. This will come in a subsequent commit.	2022-12-14 00:23:13 +01:00
Jehan	290fbd2e2e	BuildLangModel: add the licensing header to generated files.	2015-11-29 02:26:33 +01:00

Author

SHA1

Message

Date

Jehan

db836fad63

script, src: generate more code for language and sequence model listing.

Right now, each time we add new language or new charset support, we have
too many pieces of code not to forget to edit. The script
script/BuildLangModel.py will now take care of the main parts: listing
the sequence models, listing the generic language models and computing
the numbers for each listing.

Furthermore the script will now end with a TODO list of the parts which
are still to be done manually (2 functions to edit and a CMakeLists).

Finally the script now allows to give a list of languages to edit rather
of having to run it with languages one by one. It also allows 2 special
code: "none", which will retrain none of the languages, but will
re-generate only the new generated listings; and "all" which will
retrain all models (useful in particulare when we change the model
formats or usage and want to regenerate everything).

2022-12-18 17:23:34 +01:00

Jehan

b70b1ebf88

Rebuild a bunch of language models.

Adding generic language model (see coming commit), which uses the same
data as specific single-byte encoding statistics model, except that it
applies it to unicode code points.
For this to work, instead of the CharToOrderMap which was mapping
directly from encoded byte (always 256 values) to order, now we add an
array of frequent characters, ordered by generic unicode code points to
the order of frequency (which can be used on the same sequence mapping
array).

This of course means that each prober where we will want to use these
generic models will have to implement their own byte to code point
decoder, as this is per-encoding logics anyway. This will come in a
subsequent commit.

2022-12-14 00:23:13 +01:00

Jehan

290fbd2e2e

BuildLangModel: add the licensing header to generated files.

2015-11-29 02:26:33 +01:00

3 Commits