uchardet

coffee/uchardet

Fork 0

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-24 12:44:46 +08:00

Commit Graph

Author	SHA1	Message	Date
Jehan	ffb94e4a9d	script, src, test: Bulgarian language models added. Not sure why we had the Bulgarian support but haven't recently updated it (i.e. never with the model generation script, or so it seems), especially with generic language models, allowing to have UTF-8/Bulgarian support. Maybe I tested it some time ago and it was getting bad results? Anyway now with all the recents updates on the confidence computation, I get very good detection scores. So adding support for UTF-8/Bulgarian and rebuilding other models too. Also adding a test for ISO-8859-5/Bulgarian (we already had support, but no test files). The 2 new test files are text from page 'Мармоти' on Wikipedia in Bulgarian language.	2022-12-17 18:41:00 +01:00
Jehan	0efcdfa546	Reorganize test files in language subdirectories. I realize that the language information a text has been written in is very important since it would completely change the character distribution. Our test files should take this into account, and we should create several test files in different languages for encoding used in various languages.	2015-11-17 21:12:39 +01:00

Author

SHA1

Message

Date

Jehan

ffb94e4a9d

script, src, test: Bulgarian language models added.

Not sure why we had the Bulgarian support but haven't recently updated
it (i.e. never with the model generation script, or so it seems),
especially with generic language models, allowing to have
UTF-8/Bulgarian support. Maybe I tested it some time ago and it was
getting bad results? Anyway now with all the recents updates on the
confidence computation, I get very good detection scores.

So adding support for UTF-8/Bulgarian and rebuilding other models too.

Also adding a test for ISO-8859-5/Bulgarian (we already had support, but
no test files).

The 2 new test files are text from page 'Мармоти' on Wikipedia in
Bulgarian language.

2022-12-17 18:41:00 +01:00

Jehan

0efcdfa546

Reorganize test files in language subdirectories.

I realize that the language information a text has been written in is
very important since it would completely change the character
distribution. Our test files should take this into account, and we
should create several test files in different languages for encoding
used in various languages.

2015-11-17 21:12:39 +01:00

2 Commits