uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-06-15 08:26:15 +08:00

Author	SHA1	Message	Date
Jehan	0be80a21db	script, src: update Norwegian model with the new language features. As I just rebased my branch about new language detection API, I needed to re-generate Norwegian language models. Unfortunately it doesn't detect UTF-8 Norwegian text, though not far off (it detects it as second candidate with high 91% confidence; beaten by Danish UTF-8 with 94% confidence unfortunately!). Note that I also update the alphabet list for Norwegian as there were too many letters in there (according to Wikipedia at least), so even when training a model, we had some missing characters in the training set.	2022-12-14 00:24:53 +01:00
Jehan	784f614c84	script: further fixing BuildLangModel.py.	2022-12-14 00:24:53 +01:00
Jehan	6365cad4fd	script: improve a bit the management of use_ascii option.	2022-12-14 00:24:53 +01:00
Jehan	81b83fffa9	script: work around recent issue of python wikipedia module. Adding `auto_suggest=False` to the wikipedia.page() call because this auto-suggest is completely broken, searching "mar ot" instead of "marmot" or "ground hug" instead of "Groundhog" (this one is extra funny but not so useful!). I actually wonder why it even needs to suggest anything when the Wikipedia pages do actually exist! Anyway the script BuildLangModel.py was very broken because of this, now it's better. See: https://github.com/goldsmith/Wikipedia/issues/295 Also printing the error message when we discard a page, which helps debugging.	2022-12-14 00:24:53 +01:00
Jehan	a3ff09bece	test: improve test error output even more. Adding the found confidence, but also the confidence matched by the expected (lang, charset) couple, and its candidate order, if it even matched.	2022-12-14 00:24:53 +01:00
Jehan	c9446e540d	test: add stderr logging when a test fails. It allows to get some more info in Testing/Temporary/LastTest.log to debug detection issues.	2022-12-14 00:24:53 +01:00
Jehan	bfa4b10d4d	script, src: add English language model. English detection is still quite crappy so I don't add a unit test yet. Though I believe the detection being bad is mostly because of too much shortcutting we are doing to go "fast". I should probably review this whole part of the logics as well.	2022-12-14 00:24:53 +01:00
Jehan	bed459c6e7	src: drop less of UTF-8 confidence even with few non-multibyte chars. Some languages are not meant to have multibyte characters. For instance, English would typically have none. Yet you can still have UTF-8 English text (with a few special characters, or foreign words…). So anyway let's make it less of a deal breaker. To be even fairer, the whole logics is biased of course and I believe that eventually we should get rid of these lines of code dropping confidence on a number of character. This is a ridiculous rule (we base on our whole logics on language statistics and suddenly we add some weird rule with a completely random number). But for now, I'll keep this as-is until we make the whole library even more robust.	2022-12-14 00:24:53 +01:00
Jehan	bffb7819d2	test: fix test binary build for Windows. realpath() doesn't exist on Windows. Replace it with _fullpath() which does the same thing, as far as I can see (at least for creating an absolute path, it doesn't seem to canonicalize the path, or the docs doesn't say it, yet since we are controlling the arguments from our CMake script, it's not a big problem anyway). This fixed the CI build for Windows failing with: > undefined reference to `realpath'	2022-12-14 00:24:53 +01:00
Jehan	5cf3c648fb	src: reset shortcut charset/language on Reset(). Failing to do so, we always return the same language once we detected a shortcut one, even after resetting. For instance, the issue happened on the uchardet CLI tool.	2022-12-14 00:24:53 +01:00
Jehan	d6c5c26150	src: do not test with nsLatin1Prober anymore. Just commenting it out for now. This is just not good enough and could take over detection when other probers have low confidence (yet reasonable ones), returning an ugly WINDOWS-1252 with no language detection. I think we should even just get rid of it completely. For now, I temporarily uncomment it and will see with further experiments.	2022-12-14 00:24:53 +01:00
Jehan	6436e1dd47	src: improve confidence computation (generic and single-byte charset). Nearly the same algorithm on both pieces of code now. I reintroduced the mTypicalPositiveRatio now that our models actually gives the right ratio (not the "first 512" meaningless stuff anymore). In remaining differences, the last computation is the ratio of frequent characters on the whole characters. For the generic detector, we use the frequent+out sum instead. It works much better. I think that Unicode text is much more prone to have characters outside your expected range, while still being meaningful characters. Even control characters are much more meaningful in Unicode. So a ratio off it would make much too low confidence. Anyway this confidence algorithm is already better. We seem to approach much nicer confidence at each iteration, very satisfying!	2022-12-14 00:24:53 +01:00
Jehan	8e2cf7b81b	script: generate more complete frequent characters when range is set. The early version used to stop earlier, assuming frequent ranges were used only for language scripts with a lot of characters (such as Korean, or even more Japanese or Chinese), hence it was not efficient to keep data for them all. Since we now use a separate language detector for CJK, remaining scripts (so far) have a usable range of characters. Therefore it is much prefered to keep as much data as possible on these. This allowed to redo the Thai model (cf. previous commit) with more data, hence get much better language confidence on Thai texts.	2022-12-14 00:24:53 +01:00
Jehan	314f062c70	script, src: regenerate the Thai model. With all the changes we made, regenerate the Thai model which is of poor quality. This new one is much better.	2022-12-14 00:24:53 +01:00
Jehan	41fec68674	src, script: fix the order of characters for Vietnamese. Cf. commit 872294d.	2022-12-14 00:24:53 +01:00
Jehan	338a51564a	src, script: add concept of alphabet_mapping in language models. This allows to handle cases where some characters are actually alternative/variants of another. For instance, a same word can be written with both variants, while both are considered correct and equivalent. Browsing a bit Slovenian Wikipedia, it looks like they only use them for titles there. I use this the first time on characters with diacritics in Slovene. Indeed these are so rarely used that they would hardly show in the stats and worse, any sequence using these in tested text would likely show as negative sequences hence drop the confidence in Slovenian. As a consequence, various Slovene text would show up as Slovak as it's close enough and contains the same character with diacritics in a common way.	2022-12-14 00:24:53 +01:00
Jehan	ba7d72e3b0	script: regenerate Slovak and Slovene with better alphabet support. I was missing some characters, especially in the Slovak alphabet. Oppositely the Slovene alphabet does not use 4 of the common ASCII alphabet.	2022-12-14 00:24:53 +01:00
Jehan	adb158b058	script: fix a stupid bug making same ratio for all frequent characters. Argh! How did I miss this!	2022-12-14 00:24:53 +01:00
Jehan	19737886fe	script, src: regenerate the Vietnamese model. The alphabet was not complete and thus confidence was a bit too low. For instance the VISCII test case's confidence bumped from 0.643401 to 0.696346 and the UTF-8 test case bumped from 0.863777 to 0.99. Only the Windows-1258 test case is slightly worse from 0.532846 to 0.532098. But the overwhole recognition gain is obvious anyway.	2022-12-14 00:24:53 +01:00
Jehan	9d29c3e26f	src: fix negative confidence wrapping around because of unsigned int. In extreme case of more mCtrlChar than mTotalChar (since the later does not include control characters), we end up with a negative value, which in unsigned int becomes a huge integer. So because the confidence was so bad that it would be negative, we ended up in a huge confidence. We had this case with our Japanese UTF-8 test file which ended up identified as French ISO-8859-1. So I just cast the uint to float early on in order to avoid such pitfall. Now all our test cases succeed again, this time with full UTF-8+language support! Wouhou!	2022-12-14 00:24:53 +01:00
Jehan	b7acffc806	script, src: remove generated statistics data for Korean.	2022-12-14 00:24:53 +01:00
Jehan	b725c0b2ff	src: new nsCJKDetector specifically Chinese/Japanese/Korean recognition. I was pondering improving the logics of the LanguageModel contents, in order to better handle language with a huge number of characters (far too much to keep a full frequent list while keeping reasonable memory consumption and speed). But then I realize that this happens for languages which have anyway their own set of characters. For instance, modern Korean is near full hangul. Of course, we can find some Chinese characters here and there, but nothing which should really break confidence if we base it on the hangul ratio. Of course if some day we want to go further and detect older Korean, we will have to improve the logics a bit with some statistics, though I wonder if limiting ourselves to character frequency is not enough here (sequence frequency is maybe a bit overboard). To be tested. In any case, this new class gives much more relevant confidence on Korean texts, compared to the statistics data we previously generated. For Japanese, it is a mix of kana and Chinese characters. A modern full text cannot exist without a lot of kanas (probably only old text or very short texts, such as titles, could have only Chinese characters). We would still want to add a bit of statistics to differentiate correctly a Japanese text with a lot of Chinese characters in it and a Chinese text which quotes a bit of Japanese phrases. It will have to be improved, but for now it works fairly ok. A last case where we would want to play with statistics might be if we want to differentiate between regional variants. For instance, Simplified Chinese, Taiwan or Hong Kong Chinese… More to experiment later on. It's already a first good step for UTF-8 support with language!	2022-12-14 00:24:53 +01:00
Jehan	c782177a8d	README: fix a duplicate.	2022-12-14 00:24:53 +01:00
Jehan	3ca49e2bc1	Update README.	2022-12-14 00:24:50 +01:00
Jehan	8113f604de	src: consider any combination with a non-frequent character as sequence. Basically since we excluse non-letters (Control chars, punctuations, spaces, separators, emoticones and whatnot), we consider any remaining character as an off-script letter (we may have forgotten some cases, but so far, it looks promising). Hence it is normal to consider a combination with these (i.e. 2 off-script letters or 1 frequent letter + 1 off-script, in any order) as a sequence too. Doing so will drop the confidence even more of any text having too much of these. As a consequence, it expands again the gap between the first and second contender, which seems to really show it works.	2022-12-14 00:23:13 +01:00
Jehan	a1b186fa8b	src: add Hindi/UTF-8 support.	2022-12-14 00:23:13 +01:00
Jehan	9736950227	src: improve confidence computation. Detect various blocks of characters for punctuation, symbols, emoticons and whatnot. These are considered kind of neutral in the confidence (because it's normal to have punctuation, and various text nowadays are expected to display emoticones or various symbols). What is of interest is all the rest, which will then consider as out-of-range characters (likely characters for other scripts) and will therefore drop the confidence. Now confidence will therefore take into account the ratio of all in-range characters (script letters + various neutral characters) and the ratio of frequent letters within all letters (script letters + out-of-range characters). This improved algorithm makes for much more efficient detection, as it bumped most confidence in all our unit test, and usually increased the gap between the first and second contender.	2022-12-14 00:23:13 +01:00
Jehan	a98cdcd88f	script: fix a bit BuildLangModel.py when use_ascii is True. In particular, I prepare the case for English detection. I am not pushing actual English models yet, because it's not so efficient yet. I will do when I will be able to handle better English confidence.	2022-12-14 00:23:13 +01:00
Jehan	629bc879f3	script, src: add generic Korean model. Until now, Korean charsets had its own probers as there are no single-byte encoding for writing Korean. I now added a Korean model only for the generic character and sequence statistics. I also improved the generation script (script/BuildLangModel.py) to allow for languages without single-byte charset generation and to provide meaningful statistics even when the language script has a lot of characters (so we can't have a full sequence combination array, just too much data). It's not perfect yet. For instance our UTF-8 Korean test file ends up with confidence of 0.38503, which is low for obvious Korean text. Still it works (correctly detected, with top confidence compared to others) and is a first step toward more improvement for detection confidence.	2022-12-14 00:23:13 +01:00
Jehan	0d152ff430	src, test: fix the new Johab prober and add a test. This prober comes from MR !1 on the main branch though it was too agressive then and could not get merged. On the improved API branch, it doesn't detect other tests as Johab anymore. Also fixing it to work with the new API. Finally adding a Johab/ko unit test.	2022-12-14 00:23:13 +01:00
Jehan	3996b9d648	src: build new charset prober for Johab Korean. CMake build was not completed and enum state nsSMState disappeared in commit 53f7ad0. Also fixing a few coding style bugs. See discussion in MR !1.	2022-12-14 00:23:13 +01:00
LSY	d72a5c88ce	add charset prober for Johab Korean	2022-12-14 00:23:13 +01:00
Jehan	ded948ce15	script, src: generate the Hebrew models. The Hebrew Model had never been regenerated by my scripts. I now added the base generation files. Note that I added 2 charsets: ISO-8859-8 and WINDOWS-1255 but they are nearly identical. One of the difference is that the generic currency sign is replaced by the sheqel sign (Israel currency) in Windows-1255. And though this one lost the "double low line", apparently some Yiddish characters were added. Basically it looks like most Hebrew text would work fine with the same confidence on both charsets and detecting both is likely irrelevant. So I keep the charset file for ISO-8859-8, but won't actually use it. The good part is now that Hebrew is also recognized in UTF-8 text thanks to the new code and newly generated language model.	2022-12-14 00:23:13 +01:00
Jehan	cf0ffb0c55	test: 4 new tests for UTF-8. Taken from random pages for each of these languages. I now have a test for every 26 supported couple of (UTF-8, language). These are all working very fine and detected at the right encoding and language.	2022-12-14 00:23:13 +01:00
Jehan	a7c5a167a9	src: drop the SURE_YES confidence for character distribution probers. Some probers are based on character distribution analysis. Though it is still relevant detection logics, we also know that it is a lot less subtle than sequence distribution. Therefore let's give a good confidence for a text passing such analysis, yet not a near perfect one, thus leaving some chance for other probers. In particular, we can definitely consider that if some text gets over 0.7 on sequence distribution analysis, this is a very likely candidate. I had the case with the Finnish UTF-8 test which was passing (UTF-8, Finnish) detection with a staggering 0.86 confidence, yet was overrided by UHC (EUC-KR). This used to not be a problem when nsMBCSGroupProber would check the UTF-8 prober first and stop there with just some basic encoding detection. Now that we go further and return all relevant candidates, some simpler detection algorithm which always return too-good confidence is not the best idea.	2022-12-14 00:23:13 +01:00
Jehan	b00c85a6a6	src: do not shortcut UTF-8 detection too early. I had the case with the Czech test which was considered as Irish after being shortcutted far too early after only 16 characters. Confidence values was just barely above 0.5 for Irish (and barely below for Czech). By adding a threshold (at least 256 characters), we give a bit of relevant data to the engine to actually make an informed decision. By then, the Czech detection was at more than 0.7, whereas the Irish one at 0.6.	2022-12-14 00:23:13 +01:00
Jehan	2a16ab2310	src: nsEscCharsetProber also returns the correct language. nsEscCharsetProber will still only return a single candidate, because this is detected by a state machine, not language statistics anyway. Anyway now it will also return the language attached to the encoding.	2022-12-14 00:23:13 +01:00
Jehan	6138d9e0f0	src: make nsMBCSGroupProber report all valid candidates. Returning only the best one has limits, as it doesn't allow to check very close confidence candidates. Now in particular, the UTF-8 prober will return all ("UTF-8", lang) candidates for every language with probable statistical fit.	2022-12-14 00:23:13 +01:00
Jehan	2127f4fc0d	src: allow for nsCharSetProber to return several candidates. No functional change yet because all probers still return 1 candidate. Yet now we add a GetCandidates() method to return a number of candidates. GetCharSetName(), GetLanguage() and GetConfidence() now take a parameter which is the candidate index (which must be below the return value of GetCandidates()). We can now consider that nsCharSetProber computes a couple (charset, language) and that the confidence is for this specific couple, not just the confidence for charset detection.	2022-12-14 00:23:13 +01:00
Jehan	ea32980273	src: nsMBCSGroupProber confidence weighed by language confidence. Since our whole charset detection logics is based on text having meaning (using actual language statistics), just because a text is valid UTF-8 does not mean it is absolutely the right encoding. It may also fit other encoding with maybe very high statistical confidence (and therefore a better candidate). Therefore instead of just returning 0.99 or other high values, let's weigh our encoding confidence with the best language confidence.	2022-12-14 00:23:13 +01:00
Jehan	25d2890676	src: tweak again the language detection confidence. Computing a logical number of sequence was a big mistake. In particular, a language with only positive sequence would have the same score as a language with a mix of only positive and probable sequence (i.e. 1.0). Instead, just use the real number of sequence, but probable of sequence don't bring +1 to the numerator. Also drop the mTypicalPositiveRatio, at least for now. In my tests, it mostly made results worse. Maybe this would still make sense for language with a huge number of characters (like CJK languages), for which we won't have the full list of characters in our "frequent" list of characters. Yet for most other languages, we actually list all the possible sequences within the character set, therefore any sequence out of our sequence list should necessarily drop confidence. Tweaking the result backup up with some ratio is therefore counter-productive. As for CJK cases, we'll see how to handle the much higher number of sequences (too many to list them all) when we get there.	2022-12-14 00:23:13 +01:00
Jehan	1b5e68be00	test: update unit test to check detected languages. Excepting ASCII, UTF-16 and UTF-32 for which we don't detect languages yet.	2022-12-14 00:23:13 +01:00
Jehan	82c1d2b25e	src: reset language detectors when resetting a nsMBCSGroupProber.	2022-12-14 00:23:13 +01:00
Jehan	eb8308d50a	src, script: regenerate all existing language models. Now making sure that we have a generic language model working with UTF-8 for all 26 supported models which had single-byte encoding support until now.	2022-12-14 00:23:13 +01:00
Jehan	5257fc1abf	Using the generic language detector in UTF-8 detection. Now the UTF-8 prober would not only detect valid UTF-8, but would also detect the most probable language. Using the data generated 2 commits away, this works very well. This is still basic and will require even more improvements. In particular, now the nsUTF8Prober should return an array of ("UTF-8", language) couple candidate. And nsMBCSGroupProber should itself forward these candidates as well as other candidates from other multi-byte detectors. This way, the public-facing API would get more probable candidates, in case the algorithm is slightly wrong. Also the UTF-8 confidence is currently stupidly high as soon as we consider it to be right. We should likely weigh it with language detection (in particular, if no language is detected, this should severely weigh down UTF-8 detection; not to 0, but high enough to be a fallback in case no other encoding+lang is valid and low enough to give chances to other good candidate couples.	2022-12-14 00:23:13 +01:00
Jehan	dac7cbd30f	New generic language detector class. It detects languages similarly to the single byte encoding detector algorithm, based on character frequency and sequence frequency, except it does it generically from unicode codepoint, not caring at all about the original encoding. The confidence algorithm for language is very similar to the confidence algorithm for encoding+language in nsSBCharSetProber, though I tweaked it a little making it more trustworthy. And I plan to tweak it even a bit more later, as I improve progressively the detection logics with some of the idea I had.	2022-12-14 00:23:13 +01:00
Jehan	b70b1ebf88	Rebuild a bunch of language models. Adding generic language model (see coming commit), which uses the same data as specific single-byte encoding statistics model, except that it applies it to unicode code points. For this to work, instead of the CharToOrderMap which was mapping directly from encoded byte (always 256 values) to order, now we add an array of frequent characters, ordered by generic unicode code points to the order of frequency (which can be used on the same sequence mapping array). This of course means that each prober where we will want to use these generic models will have to implement their own byte to code point decoder, as this is per-encoding logics anyway. This will come in a subsequent commit.	2022-12-14 00:23:13 +01:00
Jehan	a0bfba3db3	src: add a --weight option to the CLI tool. Syntax is: lang1:weight1,lang2:weight2… For instance: `uchardet -wfr:1.1,it:1.05 file.txt` if you think a file is probably French or maybe Italian.	2022-12-14 00:23:13 +01:00
Jehan	669ede73a3	src: new weight concept in the C API. Pretty basic, you can weight prefered language and this will impact the result. Say the algorithm "hesitates" between encoding E1 in language L1 and encoding E2 in language L2. By setting L2 with a 1.1 weight, for instance because this is the OS language, or usual prefered language, you may help the algorithm to overcome very tight cases. It can also be helpful when you already know for sure the language of a document, you just don't know its encoding. Then you may set a very high value for this language, or simply set a default value of 0, and set 1 for this language. Only relevant encoding will be taken into account. This is still limited though as generic encoding are still implemented language-agnostic. UTF-8 for instance would be disadvantaged by this weight system until we make it language-aware.	2022-12-14 00:23:13 +01:00
Jehan	f74d602449	src: fix the usage of `uchardet` tool. It was displaying -v for both verbose and version options. The new --verbose short option is actually -V (uppercase).	2022-12-14 00:23:13 +01:00

1 2 3 4 5 ...

343 Commits