uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-08 01:36:41 +08:00

Author	SHA1	Message	Date
Jehan	0fffc109b5	script, src, test: adding Belarusian support. Support for UTF-8, Windows-1251 and ISO-8859-5. The test contents comes from page 'Суркі' on Wikipedia in Belarusian.	2022-12-17 19:13:03 +01:00
Jehan	ffb94e4a9d	script, src, test: Bulgarian language models added. Not sure why we had the Bulgarian support but haven't recently updated it (i.e. never with the model generation script, or so it seems), especially with generic language models, allowing to have UTF-8/Bulgarian support. Maybe I tested it some time ago and it was getting bad results? Anyway now with all the recents updates on the confidence computation, I get very good detection scores. So adding support for UTF-8/Bulgarian and rebuilding other models too. Also adding a test for ISO-8859-5/Bulgarian (we already had support, but no test files). The 2 new test files are text from page 'Мармоти' on Wikipedia in Bulgarian language.	2022-12-17 18:41:00 +01:00
Jehan	4f35cd4416	src: when checking for candidates, make sure we haven't any unprocessed… … language data left.	2022-12-14 08:39:49 +01:00
Jehan	baeefc0958	src: process pending language data when we are going to pass buffer size. We were experiencing segmentation fault when processing long texts because we were ending up trying to access out-of-range data (from codePointBuffer). Verify when this will happen and process data to reset the index before adding more code points.	2022-12-14 00:24:53 +01:00
Jehan	0be80a21db	script, src: update Norwegian model with the new language features. As I just rebased my branch about new language detection API, I needed to re-generate Norwegian language models. Unfortunately it doesn't detect UTF-8 Norwegian text, though not far off (it detects it as second candidate with high 91% confidence; beaten by Danish UTF-8 with 94% confidence unfortunately!). Note that I also update the alphabet list for Norwegian as there were too many letters in there (according to Wikipedia at least), so even when training a model, we had some missing characters in the training set.	2022-12-14 00:24:53 +01:00
Jehan	bfa4b10d4d	script, src: add English language model. English detection is still quite crappy so I don't add a unit test yet. Though I believe the detection being bad is mostly because of too much shortcutting we are doing to go "fast". I should probably review this whole part of the logics as well.	2022-12-14 00:24:53 +01:00
Jehan	b7acffc806	script, src: remove generated statistics data for Korean.	2022-12-14 00:24:53 +01:00
Jehan	b725c0b2ff	src: new nsCJKDetector specifically Chinese/Japanese/Korean recognition. I was pondering improving the logics of the LanguageModel contents, in order to better handle language with a huge number of characters (far too much to keep a full frequent list while keeping reasonable memory consumption and speed). But then I realize that this happens for languages which have anyway their own set of characters. For instance, modern Korean is near full hangul. Of course, we can find some Chinese characters here and there, but nothing which should really break confidence if we base it on the hangul ratio. Of course if some day we want to go further and detect older Korean, we will have to improve the logics a bit with some statistics, though I wonder if limiting ourselves to character frequency is not enough here (sequence frequency is maybe a bit overboard). To be tested. In any case, this new class gives much more relevant confidence on Korean texts, compared to the statistics data we previously generated. For Japanese, it is a mix of kana and Chinese characters. A modern full text cannot exist without a lot of kanas (probably only old text or very short texts, such as titles, could have only Chinese characters). We would still want to add a bit of statistics to differentiate correctly a Japanese text with a lot of Chinese characters in it and a Chinese text which quotes a bit of Japanese phrases. It will have to be improved, but for now it works fairly ok. A last case where we would want to play with statistics might be if we want to differentiate between regional variants. For instance, Simplified Chinese, Taiwan or Hong Kong Chinese… More to experiment later on. It's already a first good step for UTF-8 support with language!	2022-12-14 00:24:53 +01:00
Jehan	a1b186fa8b	src: add Hindi/UTF-8 support.	2022-12-14 00:23:13 +01:00
Jehan	629bc879f3	script, src: add generic Korean model. Until now, Korean charsets had its own probers as there are no single-byte encoding for writing Korean. I now added a Korean model only for the generic character and sequence statistics. I also improved the generation script (script/BuildLangModel.py) to allow for languages without single-byte charset generation and to provide meaningful statistics even when the language script has a lot of characters (so we can't have a full sequence combination array, just too much data). It's not perfect yet. For instance our UTF-8 Korean test file ends up with confidence of 0.38503, which is low for obvious Korean text. Still it works (correctly detected, with top confidence compared to others) and is a first step toward more improvement for detection confidence.	2022-12-14 00:23:13 +01:00
Jehan	0d152ff430	src, test: fix the new Johab prober and add a test. This prober comes from MR !1 on the main branch though it was too agressive then and could not get merged. On the improved API branch, it doesn't detect other tests as Johab anymore. Also fixing it to work with the new API. Finally adding a Johab/ko unit test.	2022-12-14 00:23:13 +01:00
Jehan	3996b9d648	src: build new charset prober for Johab Korean. CMake build was not completed and enum state nsSMState disappeared in commit 53f7ad0. Also fixing a few coding style bugs. See discussion in MR !1.	2022-12-14 00:23:13 +01:00
LSY	d72a5c88ce	add charset prober for Johab Korean	2022-12-14 00:23:13 +01:00
Jehan	ded948ce15	script, src: generate the Hebrew models. The Hebrew Model had never been regenerated by my scripts. I now added the base generation files. Note that I added 2 charsets: ISO-8859-8 and WINDOWS-1255 but they are nearly identical. One of the difference is that the generic currency sign is replaced by the sheqel sign (Israel currency) in Windows-1255. And though this one lost the "double low line", apparently some Yiddish characters were added. Basically it looks like most Hebrew text would work fine with the same confidence on both charsets and detecting both is likely irrelevant. So I keep the charset file for ISO-8859-8, but won't actually use it. The good part is now that Hebrew is also recognized in UTF-8 text thanks to the new code and newly generated language model.	2022-12-14 00:23:13 +01:00
Jehan	6138d9e0f0	src: make nsMBCSGroupProber report all valid candidates. Returning only the best one has limits, as it doesn't allow to check very close confidence candidates. Now in particular, the UTF-8 prober will return all ("UTF-8", lang) candidates for every language with probable statistical fit.	2022-12-14 00:23:13 +01:00
Jehan	2127f4fc0d	src: allow for nsCharSetProber to return several candidates. No functional change yet because all probers still return 1 candidate. Yet now we add a GetCandidates() method to return a number of candidates. GetCharSetName(), GetLanguage() and GetConfidence() now take a parameter which is the candidate index (which must be below the return value of GetCandidates()). We can now consider that nsCharSetProber computes a couple (charset, language) and that the confidence is for this specific couple, not just the confidence for charset detection.	2022-12-14 00:23:13 +01:00
Jehan	ea32980273	src: nsMBCSGroupProber confidence weighed by language confidence. Since our whole charset detection logics is based on text having meaning (using actual language statistics), just because a text is valid UTF-8 does not mean it is absolutely the right encoding. It may also fit other encoding with maybe very high statistical confidence (and therefore a better candidate). Therefore instead of just returning 0.99 or other high values, let's weigh our encoding confidence with the best language confidence.	2022-12-14 00:23:13 +01:00
Jehan	82c1d2b25e	src: reset language detectors when resetting a nsMBCSGroupProber.	2022-12-14 00:23:13 +01:00
Jehan	eb8308d50a	src, script: regenerate all existing language models. Now making sure that we have a generic language model working with UTF-8 for all 26 supported models which had single-byte encoding support until now.	2022-12-14 00:23:13 +01:00
Jehan	5257fc1abf	Using the generic language detector in UTF-8 detection. Now the UTF-8 prober would not only detect valid UTF-8, but would also detect the most probable language. Using the data generated 2 commits away, this works very well. This is still basic and will require even more improvements. In particular, now the nsUTF8Prober should return an array of ("UTF-8", language) couple candidate. And nsMBCSGroupProber should itself forward these candidates as well as other candidates from other multi-byte detectors. This way, the public-facing API would get more probable candidates, in case the algorithm is slightly wrong. Also the UTF-8 confidence is currently stupidly high as soon as we consider it to be right. We should likely weigh it with language detection (in particular, if no language is detected, this should severely weigh down UTF-8 detection; not to 0, but high enough to be a fallback in case no other encoding+lang is valid and low enough to give chances to other good candidate couples.	2022-12-14 00:23:13 +01:00
Jehan	5a949265d5	src: new API to get the detected language. This doesn't work for all probers yet, in particular not for the most generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL. It's still a good first step. Right now, it returns the 2-character language code from ISO 639-1. A using project could easily get the English language name from the XML/json files provided by the iso-codes project. This project will also allow to easily localize the language name in other languages through gettext (this is what we do in GIMP for instance). I don't add any dependency though and leave it to downstream projects to implement this. I was also wondering if we want to support region information for cases when it would make sense. I especially wondered about it for Chinese encodings as some of them seem quite specific to a region (according to Wikipedia at least). For the time being though, these just return "zh". We'll see later if it makes sense to be more accurate (maybe depending on reports?).	2022-12-14 00:23:13 +01:00
Jehan	dc371f3ba9	uchardet_get_charset() must return iconv-compatible names. It was not clear if our naming followed any kind of rules. In particular, iconv is a widely used encoding conversion API. We will follow its naming. At least 1 returned name was found invalid: x-euc-tw instead of EUC-TW. Other names have been uppercased to follow naming from `iconv --list` though iconv is mostly case-insensitive so it should not have been a problem. "Just in case". Prober names can still have free naming (only used for output display apparently). Finally HZ-GB-2312 is absent from my iconv list, but I can still see this encoding in libiconv master code with this name. So I will consider it valid.	2015-11-17 16:15:21 +01:00
BYVoid	84284eccf4	Update code from upstream.	2011-07-11 14:42:50 +08:00
BYVoid	3601900164	Initial release.	2011-07-10 15:04:42 +08:00

24 Commits