uchardet

coffee/uchardet

Fork 0

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-06 16:56:40 +08:00

Commit Graph

Author	SHA1	Message	Date
Jehan	7875272a8c	script, src, test: new Georgian support. For charsets UTF-8, GEORGIAN-ACADEMY and GEORGIAN-PS. The 2 GEORGIAN-* sets were generated thanks to the new create-table.py script. Test text comes from page 'ვირზაზუნა' page of Wikipedia in Georgian.	2022-12-20 14:28:29 +01:00
Jehan	d40e5868d5	script, src, test: adding Catalan support. For UTF-8, ISO-8859-1 and WINDOWS-1252 support. The test for UTF-8 and ISO-8859-1 is taken from 'Marmota' page on Wikipedia in Catalan. The test for WINDOWS-1252 is taken from the 'Unió_Europea' page. ISO-8859-1 and WINDOWS-1252 being very similar, regarding most letters (in particular the ones used in Catalan), I differentiated the test with a text containing the '€' symbol, which is on an unused spot in ISO-8859-1.	2022-12-20 01:46:15 +01:00
Jehan	db836fad63	script, src: generate more code for language and sequence model listing. Right now, each time we add new language or new charset support, we have too many pieces of code not to forget to edit. The script script/BuildLangModel.py will now take care of the main parts: listing the sequence models, listing the generic language models and computing the numbers for each listing. Furthermore the script will now end with a TODO list of the parts which are still to be done manually (2 functions to edit and a CMakeLists). Finally the script now allows to give a list of languages to edit rather of having to run it with languages one by one. It also allows 2 special code: "none", which will retrain none of the languages, but will re-generate only the new generated listings; and "all" which will retrain all models (useful in particulare when we change the model formats or usage and want to regenerate everything).	2022-12-18 17:23:34 +01:00

Author

SHA1

Message

Date

Jehan

7875272a8c

script, src, test: new Georgian support.

For charsets UTF-8, GEORGIAN-ACADEMY and GEORGIAN-PS. The 2 GEORGIAN-*
sets were generated thanks to the new create-table.py script.

Test text comes from page 'ვირზაზუნა' page of Wikipedia in Georgian.

2022-12-20 14:28:29 +01:00

Jehan

d40e5868d5

script, src, test: adding Catalan support.

For UTF-8, ISO-8859-1 and WINDOWS-1252 support.

The test for UTF-8 and ISO-8859-1 is taken from 'Marmota' page on
Wikipedia in Catalan. The test for WINDOWS-1252 is taken from the
'Unió_Europea' page. ISO-8859-1 and WINDOWS-1252 being very similar,
regarding most letters (in particular the ones used in Catalan), I
differentiated the test with a text containing the '€' symbol, which is
on an unused spot in ISO-8859-1.

2022-12-20 01:46:15 +01:00

Jehan

db836fad63

script, src: generate more code for language and sequence model listing.

Right now, each time we add new language or new charset support, we have
too many pieces of code not to forget to edit. The script
script/BuildLangModel.py will now take care of the main parts: listing
the sequence models, listing the generic language models and computing
the numbers for each listing.

Furthermore the script will now end with a TODO list of the parts which
are still to be done manually (2 functions to edit and a CMakeLists).

Finally the script now allows to give a list of languages to edit rather
of having to run it with languages one by one. It also allows 2 special
code: "none", which will retrain none of the languages, but will
re-generate only the new generated listings; and "all" which will
retrain all models (useful in particulare when we change the model
formats or usage and want to regenerate everything).

2022-12-18 17:23:34 +01:00

3 Commits