uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-01-01 03:12:24 +08:00

Author	SHA1	Message	Date
Jehan	ffb94e4a9d	script, src, test: Bulgarian language models added. Not sure why we had the Bulgarian support but haven't recently updated it (i.e. never with the model generation script, or so it seems), especially with generic language models, allowing to have UTF-8/Bulgarian support. Maybe I tested it some time ago and it was getting bad results? Anyway now with all the recents updates on the confidence computation, I get very good detection scores. So adding support for UTF-8/Bulgarian and rebuilding other models too. Also adding a test for ISO-8859-5/Bulgarian (we already had support, but no test files). The 2 new test files are text from page 'Мармоти' on Wikipedia in Bulgarian language.	2022-12-17 18:41:00 +01:00
Jehan	5a949265d5	src: new API to get the detected language. This doesn't work for all probers yet, in particular not for the most generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL. It's still a good first step. Right now, it returns the 2-character language code from ISO 639-1. A using project could easily get the English language name from the XML/json files provided by the iso-codes project. This project will also allow to easily localize the language name in other languages through gettext (this is what we do in GIMP for instance). I don't add any dependency though and leave it to downstream projects to implement this. I was also wondering if we want to support region information for cases when it would make sense. I especially wondered about it for Chinese encodings as some of them seem quite specific to a region (according to Wikipedia at least). For the time being though, these just return "zh". We'll see later if it makes sense to be more accurate (maybe depending on reports?).	2022-12-14 00:23:13 +01:00
Jehan	d686fcc1cd	LangModels: add illegal codepoints information on single byte charmaps.	2015-12-03 19:04:07 +01:00
Jehan	dbb4c1d2ff	nsSBCharSetProber: replace the fixed 64 SAMPLE_SIZE... ... with per-language model "frequent character" count.	2015-11-29 23:51:55 +01:00
Jehan	2106173546	Move all Single-Byte language models to a subdirectory.	2015-11-27 23:11:23 +01:00

5 Commits