uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-07 01:06:40 +08:00

Author	SHA1	Message	Date
Jehan	4ef378ce2e	script, src: remove generated statistics data for Korean.	2021-03-20 22:59:52 +01:00
Jehan	310e750abd	src: new nsCJKDetector specifically Chinese/Japanese/Korean recognition. I was pondering improving the logics of the LanguageModel contents, in order to better handle language with a huge number of characters (far too much to keep a full frequent list while keeping reasonable memory consumption and speed). But then I realize that this happens for languages which have anyway their own set of characters. For instance, modern Korean is near full hangul. Of course, we can find some Chinese characters here and there, but nothing which should really break confidence if we base it on the hangul ratio. Of course if some day we want to go further and detect older Korean, we will have to improve the logics a bit with some statistics, though I wonder if limiting ourselves to character frequency is not enough here (sequence frequency is maybe a bit overboard). To be tested. In any case, this new class gives much more relevant confidence on Korean texts, compared to the statistics data we previously generated. For Japanese, it is a mix of kana and Chinese characters. A modern full text cannot exist without a lot of kanas (probably only old text or very short texts, such as titles, could have only Chinese characters). We would still want to add a bit of statistics to differentiate correctly a Japanese text with a lot of Chinese characters in it and a Chinese text which quotes a bit of Japanese phrases. It will have to be improved, but for now it works fairly ok. A last case where we would want to play with statistics might be if we want to differentiate between regional variants. For instance, Simplified Chinese, Taiwan or Hong Kong Chinese… More to experiment later on. It's already a first good step for UTF-8 support with language!	2021-03-20 22:43:36 +01:00
Jehan	0729dfa974	src: add Hindi/UTF-8 support.	2021-03-19 22:36:30 +01:00
Jehan	36fd02fc2d	script, src: add generic Korean model. Until now, Korean charsets had its own probers as there are no single-byte encoding for writing Korean. I now added a Korean model only for the generic character and sequence statistics. I also improved the generation script (script/BuildLangModel.py) to allow for languages without single-byte charset generation and to provide meaningful statistics even when the language script has a lot of characters (so we can't have a full sequence combination array, just too much data). It's not perfect yet. For instance our UTF-8 Korean test file ends up with confidence of 0.38503, which is low for obvious Korean text. Still it works (correctly detected, with top confidence compared to others) and is a first step toward more improvement for detection confidence.	2021-03-19 16:48:16 +01:00
Jehan	ccb5d40a6f	src, test: fix the new Johab prober and add a test. This prober comes from MR !1 on the main branch though it was too agressive then and could not get merged. On the improved API branch, it doesn't detect other tests as Johab anymore. Also fixing it to work with the new API. Finally adding a Johab/ko unit test.	2021-03-18 00:26:49 +01:00
Jehan	b1f6c88792	src: build new charset prober for Johab Korean. CMake build was not completed and enum state nsSMState disappeared in commit 53f7ad0. Also fixing a few coding style bugs. See discussion in MR !1.	2021-03-17 23:48:20 +01:00
LSY	417013219c	add charset prober for Johab Korean	2021-03-17 23:48:11 +01:00
Jehan	71ca5a7cd5	script, src: generate the Hebrew models. The Hebrew Model had never been regenerated by my scripts. I now added the base generation files. Note that I added 2 charsets: ISO-8859-8 and WINDOWS-1255 but they are nearly identical. One of the difference is that the generic currency sign is replaced by the sheqel sign (Israel currency) in Windows-1255. And though this one lost the "double low line", apparently some Yiddish characters were added. Basically it looks like most Hebrew text would work fine with the same confidence on both charsets and detecting both is likely irrelevant. So I keep the charset file for ISO-8859-8, but won't actually use it. The good part is now that Hebrew is also recognized in UTF-8 text thanks to the new code and newly generated language model.	2021-03-17 23:22:50 +01:00
Jehan	ba6b46a68c	src: make nsMBCSGroupProber report all valid candidates. Returning only the best one has limits, as it doesn't allow to check very close confidence candidates. Now in particular, the UTF-8 prober will return all ("UTF-8", lang) candidates for every language with probable statistical fit.	2021-03-17 16:38:20 +01:00
Jehan	49ed0e6f45	src: allow for nsCharSetProber to return several candidates. No functional change yet because all probers still return 1 candidate. Yet now we add a GetCandidates() method to return a number of candidates. GetCharSetName(), GetLanguage() and GetConfidence() now take a parameter which is the candidate index (which must be below the return value of GetCandidates()). We can now consider that nsCharSetProber computes a couple (charset, language) and that the confidence is for this specific couple, not just the confidence for charset detection.	2021-03-17 13:29:13 +01:00
Jehan	41fc0f235b	src: nsMBCSGroupProber confidence weighed by language confidence. Since our whole charset detection logics is based on text having meaning (using actual language statistics), just because a text is valid UTF-8 does not mean it is absolutely the right encoding. It may also fit other encoding with maybe very high statistical confidence (and therefore a better candidate). Therefore instead of just returning 0.99 or other high values, let's weigh our encoding confidence with the best language confidence.	2021-03-17 13:09:10 +01:00
Jehan	f30c1cd8c8	src: reset language detectors when resetting a nsMBCSGroupProber.	2021-03-17 11:03:30 +01:00
Jehan	5c3a2e8037	src, script: regenerate all existing language models. Now making sure that we have a generic language model working with UTF-8 for all 26 supported models which had single-byte encoding support until now.	2021-03-17 02:07:17 +01:00
Jehan	2a4d8d890e	Using the generic language detector in UTF-8 detection. Now the UTF-8 prober would not only detect valid UTF-8, but would also detect the most probable language. Using the data generated 2 commits away, this works very well. This is still basic and will require even more improvements. In particular, now the nsUTF8Prober should return an array of ("UTF-8", language) couple candidate. And nsMBCSGroupProber should itself forward these candidates as well as other candidates from other multi-byte detectors. This way, the public-facing API would get more probable candidates, in case the algorithm is slightly wrong. Also the UTF-8 confidence is currently stupidly high as soon as we consider it to be right. We should likely weigh it with language detection (in particular, if no language is detected, this should severely weigh down UTF-8 detection; not to 0, but high enough to be a fallback in case no other encoding+lang is valid and low enough to give chances to other good candidate couples.	2021-03-16 18:37:09 +01:00
Jehan	911695f682	src: new API to get the detected language. This doesn't work for all probers yet, in particular not for the most generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL. It's still a good first step. Right now, it returns the 2-character language code from ISO 639-1. A using project could easily get the English language name from the XML/json files provided by the iso-codes project. This project will also allow to easily localize the language name in other languages through gettext (this is what we do in GIMP for instance). I don't add any dependency though and leave it to downstream projects to implement this. I was also wondering if we want to support region information for cases when it would make sense. I especially wondered about it for Chinese encodings as some of them seem quite specific to a region (according to Wikipedia at least). For the time being though, these just return "zh". We'll see later if it makes sense to be more accurate (maybe depending on reports?).	2021-03-14 00:12:30 +01:00
Jehan	dc371f3ba9	uchardet_get_charset() must return iconv-compatible names. It was not clear if our naming followed any kind of rules. In particular, iconv is a widely used encoding conversion API. We will follow its naming. At least 1 returned name was found invalid: x-euc-tw instead of EUC-TW. Other names have been uppercased to follow naming from `iconv --list` though iconv is mostly case-insensitive so it should not have been a problem. "Just in case". Prober names can still have free naming (only used for output display apparently). Finally HZ-GB-2312 is absent from my iconv list, but I can still see this encoding in libiconv master code with this name. So I will consider it valid.	2015-11-17 16:15:21 +01:00
BYVoid	84284eccf4	Update code from upstream.	2011-07-11 14:42:50 +08:00
BYVoid	3601900164	Initial release.	2011-07-10 15:04:42 +08:00

18 Commits