uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-01-01 03:12:24 +08:00

Author	SHA1	Message	Date
Jehan	06029ec334	src: allow setting a default language in the CLI tool. The syntax of --weight stays the same with the addition that the language '' means setting the default weight. For instance, if you are sure that your input is either French or English, you could run: > uchardet -l -w 'fr:1,en:1,:0' (setting same weight to French and English, and everything else to 0)	2025-08-08 11:40:10 +02:00
Marcus Nilsson	9699dfce07	Issue #40 : Close file when it's no longer needed	2025-06-07 23:35:44 +00:00
Gary Wang	dff8906402	fix: FTBFS under MSVC 1. __declspec(deprecated) is okay for MSVC 2. strcasecmp is POSIX-only, _stricmp should be used for MSVC Co-authored-by: yyc12345 <yyc12321@outlook.com>	2025-06-07 23:24:48 +00:00
Heiko Becker	6e163c978a	CMake: Raise required version to 3.5 CMake >= 4.0.0-rc1 removed compatibility with versions < 3.5 and errors out with such versions passed to cmake_minimum_required(). 3.5.0 has been released 9 years ago, so I'd assume it's available almost everywhere.	2025-03-28 10:30:33 +01:00
Jehan	edae8e81cf	gitlab-ci: CI is now forbidden on MR run by passing-by contributors. So apparently Freedesktop CI won't run on non-official project or non-known developers Gitlab namespaces. In particular, it makes CI fail on merge requests by such passing-by contributors! Adding these small rules is supposed to allow such jobs to run anyway. See: https://gitlab.freedesktop.org/freedesktop/freedesktop/-/issues/540	2023-11-15 17:09:19 +01:00
Jaroslav Lobačevski	b95252ff0c	Add notepad++ to readme	2023-11-15 13:39:38 +00:00
Jehan	ab1d2f1120	src: handle long sequences of characters. Actually my previous commit was not handling all cases, though it was taking care of the buffer overflow triggered by the provided byte sequence. Yet I believe it was still possible to craft special input sequences too long for codePointBuffer. This additional commit would handle these other cases by processing the input in manageable sub-strings.	2023-07-17 20:09:10 +02:00
Jehan	9910941387	Issue #33 : crafted sequence of bytes triggers memory write past the bounds of… … a heap allocated buffer. Before starting to process a multi-byte sequence, we should make sure that our buffer is not nearly full with single-byte data. If so, process said data first.	2023-07-17 18:46:35 +02:00
Jehan	8fe0b2e080	src: fix mismatched new [] / delete. This fixes this bug reported by ASAN: > ==42862==ERROR: AddressSanitizer: alloc-dealloc-mismatch (operator new [] vs operator delete) on 0x619000000080 > #0 0x7f1dc1fa2017 in operator delete(void*) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:160 > #1 0x7f1dc1e8b132 in nsSBCSGroupProber::~nsSBCSGroupProber() /home/jehan/dev/src/uchardet/src/nsSBCSGroupProber.cpp:257	2023-07-17 16:44:39 +02:00
Jehan	bc93da89d9	Issue #32 : Global buffer read overflow in `GetOrderFromCodePoint`.	2023-07-17 16:39:52 +02:00
Jehan	bd983ca108	CMake: enable ASAN in Debug builds.	2023-07-17 16:30:10 +02:00
Jehan	bdd71d88f8	script: improve a bit create-table.py and regenerate the Georgian charsets. - Avoid trailing whitespaces. - Print which tool and version were used for the generation (to help for future debugging in case of discrepancies between versions or implementations).	2022-12-20 14:38:51 +01:00
Jehan	7875272a8c	script, src, test: new Georgian support. For charsets UTF-8, GEORGIAN-ACADEMY and GEORGIAN-PS. The 2 GEORGIAN-* sets were generated thanks to the new create-table.py script. Test text comes from page 'ვირზაზუნა' page of Wikipedia in Georgian.	2022-12-20 14:28:29 +01:00
Jehan	c843d23a17	script: new create-table script. I wanted to add new tables for which I could find no listing anywhere, even though iconv has support for it (not core Python though), which are GEORGIAN-ACADEMY and GEORGIAN-PS. I could find info on these in libiconv source (./lib/georgian_academy.h and ./lib/georgian_ps.h), though rather than trying to read these, I thought I should just do the other way around: get back a table from the return value of iconv API (or Python decode() when relevant). So this script is able to generate tables in the format used under script/charsets/, from either Python decode() or iconv. It will be much useful!	2022-12-20 12:03:19 +01:00
Jehan	419a971e6a	script: update the README.	2022-12-20 01:56:24 +01:00
Jehan	d40e5868d5	script, src, test: adding Catalan support. For UTF-8, ISO-8859-1 and WINDOWS-1252 support. The test for UTF-8 and ISO-8859-1 is taken from 'Marmota' page on Wikipedia in Catalan. The test for WINDOWS-1252 is taken from the 'Unió_Europea' page. ISO-8859-1 and WINDOWS-1252 being very similar, regarding most letters (in particular the ones used in Catalan), I differentiated the test with a text containing the '€' symbol, which is on an unused spot in ISO-8859-1.	2022-12-20 01:46:15 +01:00
Jehan	cec8817d79	src: new Big5 detection implementation. Rather than using a huge frequency table through some state machine code that I don't even understand, I noticed that the Big5 encoding is from the start organized in frequent and non-frequent characters tables (per Wikipedia page on Big5). This makes it very easy to count characters by just counting which class each character is in. Making a few tests with random Chinese text converted to Big5, it seems to work pretty well (and fix the test which got broken with previous commit), and it doesn't slow down detection in any significant way either. This may be the next step towards improving also the various multi-byte encoding detection, which are still using some coding state generated machines which mostly still elude me.	2022-12-19 00:01:12 +01:00
Jehan	0fe51d3851	Issue #21 : Greek CP737 support. It actually breaks "zh:big5" so I'm going to hold-off a bit. Adding more language and charset support is slowly starting to show the limitations of our legacy multi-byte charset supports, since I haven't really touched these since the original implementation of Mozilla. It might be time to start reviewing these parts of the code. The test file contents comes from 'Μαρμότα' page on Wikipedia in Greek (though since 2 letters are missing in this encoding, despite its popularity for Greek, I had to be careful in choosing pieces of text without such letters).	2022-12-18 22:33:12 +01:00
Jehan	a82139b3bd	script: fix a notice message. Probably broken in commit db836fa (I changed a bunch of print() with sys.stderr.write()).	2022-12-18 22:24:55 +01:00
Jehan	d4ef245fdc	script: add a requirements.txt for our generation script. It will make it easier to follow any dependency change as it is kinda a standard file in Python projects. Of course, it's not a dependency to uchardet itself, only for the generation script (so for developers only), which is why I put it inside the script/ folder.	2022-12-18 17:27:38 +01:00
Jehan	db836fad63	script, src: generate more code for language and sequence model listing. Right now, each time we add new language or new charset support, we have too many pieces of code not to forget to edit. The script script/BuildLangModel.py will now take care of the main parts: listing the sequence models, listing the generic language models and computing the numbers for each listing. Furthermore the script will now end with a TODO list of the parts which are still to be done manually (2 functions to edit and a CMakeLists). Finally the script now allows to give a list of languages to edit rather of having to run it with languages one by one. It also allows 2 special code: "none", which will retrain none of the languages, but will re-generate only the new generated listings; and "all" which will retrain all models (useful in particulare when we change the model formats or usage and want to regenerate everything).	2022-12-18 17:23:34 +01:00
Jehan	d6cab28fb4	README: missing UTF-8 support listed on several languages.	2022-12-17 23:00:26 +01:00
Jehan	abd123e07d	script, src, test: add Serbian support. For UTF-8, ISO-8859-5 and WINDOWS-1251. Test files' contents come from page 'Мрмот' on Wikipedia in Serbian.	2022-12-17 22:47:54 +01:00
Jehan	d00d4d52b7	src, script: add Macedonian support. For UTF-8, ISO-8859-5, WINDOWS-1251 and IBM855 encodings. Test files' contents come from page 'Хибернација' on Wikipedia in Macedonian.	2022-12-17 22:47:54 +01:00
Jehan	41d309e8a2	script, src: regenerate Russian models and add UTF-8/Russian support. This fixes the broken Russian test in Windows-1251 which once again gets a much better score with Russian. Also this adds UTF-8 support. Same as Bulgarian, I wonder why I had not regenerated this earlier. The new UTF-8 test comes from the 'Сурки' page of Wikipedia in Russian. Note that now this broke the test zh:gb18030 (the score for KOI8-R / ru (0.766388) beats GB18030 / zh (0.700000)). I think I'll have to look a bit closer at our GB18030 dedicated prober.	2022-12-17 21:41:11 +01:00
Jehan	60dcec8a82	script, src, test: add Ukrainian support. UTF-8 and Windows-1251 support for now. This actually breaks ru:windows-1251 test but same as Bulgarian, I never generated Russian models with my scripts, so the models we currently use are quite outdated. It will obviously be a lot better once we have new Russian models. The test file contents comes from 'Бабак' page on Wikipedia in Ukrainian.	2022-12-17 21:40:56 +01:00
Jehan	0fffc109b5	script, src, test: adding Belarusian support. Support for UTF-8, Windows-1251 and ISO-8859-5. The test contents comes from page 'Суркі' on Wikipedia in Belarusian.	2022-12-17 19:13:03 +01:00
Jehan	ffb94e4a9d	script, src, test: Bulgarian language models added. Not sure why we had the Bulgarian support but haven't recently updated it (i.e. never with the model generation script, or so it seems), especially with generic language models, allowing to have UTF-8/Bulgarian support. Maybe I tested it some time ago and it was getting bad results? Anyway now with all the recents updates on the confidence computation, I get very good detection scores. So adding support for UTF-8/Bulgarian and rebuilding other models too. Also adding a test for ISO-8859-5/Bulgarian (we already had support, but no test files). The 2 new test files are text from page 'Мармоти' on Wikipedia in Bulgarian language.	2022-12-17 18:41:00 +01:00
Jehan	5e25e93da7	script: add an error handling for when iconv fail to convert from a codepoint. It could happen either when our character set table is wrong, but it could also happen for when iconv has a bug with incomplete charset tables. For instance, I was trying to implement IBM880 for #29, but iconv was missing a few codepoints. For instance, it seems to think that 0x45 (є), 0.55 (ў), 0x74 (Ў) are meant to be illegal in IBM880 (and possibly others), but the information we have seem to say they are valid. And Python does not support this character set at all. This test will help discovering the issue earlier (rather than breaking a few line later because `iconv` failed and returned an empty string, making ord() fail with TypeError exception. See: https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/29#note_1691847	2022-12-17 18:00:22 +01:00
Jehan	6d31689632	test: adding 2 tests for Hebrew/IBM862 recognition. This is the same text, taken from this Wikipedia page, which was today's page of honor on Wikipedia in Hebrew: https://he.wikipedia.org/wiki/שתי מסכתות על ממשל מדיני I put it in 2 variants, since IBM862 can be used in logical and visual variants. The visual variant is just about inverting orders of letters (per lines, while lines stay in proper order), so that's what I did. Though note that the English title quoted in the text should likely not have been reverted, but it doesn't matter too much since anyway these are off-Hebrew alphabet and would trigger bad sequence score, whichever their order. So I didn't bother fixing these.	2022-12-16 23:35:17 +01:00
Jehan	0974920bdd	Issue #22 : Hebrew CP862 support. Added in both visual and logical order since Wikipedia says: > Hebrew text encoded using code page 862 was usually stored in visual > order; nevertheless, a few DOS applications, notably a word processor > named EinsteinWriter, stored Hebrew in logical order. I am not using the nsHebrewProber wrapper (nameProber) for this new support, because I am really unsure this is of any use. Our statistical code based on letter and sequence usage should be more than enough to detect both variants of Hebrew encoding already, and my testing show that so far (with pretty outstanding score on actual Hebrew tests while all the other probers return bad scores). This will have to be studied a bit more later and maybe the whole nsHebrewProber might be deleted, even for Windows-1255 charset. I'm also cleaning a bit nsSBCSGroupProber::nsSBCSGroupProber() code by incrementing a single index, instead of maintaining the indexes by hand (otherwise each time we add probers in the middle, to keep them logically gathered by languages, we have to manually increment dozens of following probers).	2022-12-16 23:27:52 +01:00
Jehan	127d7faf47	test: add ability to have several tests per charsets. While the expected charset name is still the first part of the test file (until the first point character), the test name is all but the last part (until the last point character). This will allow to have several test files for a single charset. In particular, I want 2 test files at least for Hebrew when it has a visual and logical variant. So I could call these "ibm862.visual.txt" and "ibm862.logical.txt" which both expect IBM862 as a result charset, but test names will "he:ibm862.visual" and he:ibm862.logical" respectively. Without this change, the test names would collide and CMake would refuse these.	2022-12-16 23:10:34 +01:00
Jehan	3a6806ab19	test: no:utf-8 is actually working now, after the last model script fix… … and rebuild of models. The scores are really not bad now, 0.896026 for Norwegian and 0.877947 for Danish. It looks like the last confidence computation changes I did are really giving fruits!	2022-12-15 15:11:17 +01:00
Jehan	e6e51d9fe8	src: all language models now rebuilt after the fix.	2022-12-15 14:31:55 +01:00
Jehan	362086bf56	script: fix BuildLangModel.py.	2022-12-15 14:31:10 +01:00
Jehan	598fe90c91	test: finally add English/UTF-8 test file. I had this test file locally for some time now, but it was always failing, and recognized as other languages until now. Thanks to the recent confidence improvements with new frequent/rare ratios, it is finally detected as English by uchardet!	2022-12-14 21:45:29 +01:00
Jehan	6bb1b3e101	scripts: all language models rebuilt with the new ratio data.	2022-12-14 20:16:44 +01:00
Jehan	e311b64cd9	script: model-building script updated to produce the 2 new ratios… … introduced in previous commit.	2022-12-14 20:15:34 +01:00
Jehan	401eb55dfc	src: improve algorithm for confidence computation. Additionally to the "frequent characters" concept, we add 2 sub-categories, which are the "very frequent characters" and "rare characters". The former are usually just a few characters which are used most of the time (like 3 or 4 characters used 40% of the time!), whereas the later are often a dozen or more characters which are barely used a few percents of the time, all together. We use this additional concept to help distinguish very similar languages, or languages whose frequent characters are a subset of the ones from another language (typically English, whose alphabet is a subset of many other European languages). The mTypicalPositiveRatio is getting rid of, as it was anyway barely of any use (it was 0.99-something for nearly all languages!). Instead we get these 2 new ratios: veryFreqRatio and lowFreqRatio, and of course the associated order counts to know which character are in these sets.	2022-12-14 20:02:59 +01:00
Jehan	4f35cd4416	src: when checking for candidates, make sure we haven't any unprocessed… … language data left.	2022-12-14 08:39:49 +01:00
Jehan	7f386d922e	script, src: rebuild the English model. The previous model was most obviously wrong: all letters had the same probability, even non-ASCII ones! Anyway this new model does make unit tests a tiny bit better though the English detection is still weak (I have more concepts which I want to experiment to get this better).	2022-12-14 00:36:02 +01:00
Jehan	fb433a57b5	src: add a --language\|-l option to the uchardet CLI tool.	2022-12-14 00:24:53 +01:00
Jehan	908f9b8ba7	src, test: rename s/uchardet_get_candidates/uchardet_get_n_candidates/. This was badly named as this function does not return candidates, but the number of candidates (to be actually used in other API).	2022-12-14 00:24:53 +01:00
Jehan	a916fb1c56	test: temporarily disable the Norwegian/UTF-8 test. It currently recognizes as Danish/UTF-8 with 0.958 score, though Norwegian/UTF-8 is indeed the second candidate with 0.911 (the third candidate is far behind, Swedish/UTF-8 with 0.815). Before wasting time tweaking models, there are more basic conceptual changes that I want to implement first (it might be enough to change the results!). So let's skip this test for now.	2022-12-14 00:24:53 +01:00
Jehan	baeefc0958	src: process pending language data when we are going to pass buffer size. We were experiencing segmentation fault when processing long texts because we were ending up trying to access out-of-range data (from codePointBuffer). Verify when this will happen and process data to reset the index before adding more code points.	2022-12-14 00:24:53 +01:00
Jehan	b5b75b81ce	script, src: rebuild the Danish model. Now that it has IBM865 support on the main branch and that I rebased, this feature branch for the new API got broken too.	2022-12-14 00:24:53 +01:00
Jehan	0be80a21db	script, src: update Norwegian model with the new language features. As I just rebased my branch about new language detection API, I needed to re-generate Norwegian language models. Unfortunately it doesn't detect UTF-8 Norwegian text, though not far off (it detects it as second candidate with high 91% confidence; beaten by Danish UTF-8 with 94% confidence unfortunately!). Note that I also update the alphabet list for Norwegian as there were too many letters in there (according to Wikipedia at least), so even when training a model, we had some missing characters in the training set.	2022-12-14 00:24:53 +01:00
Jehan	784f614c84	script: further fixing BuildLangModel.py.	2022-12-14 00:24:53 +01:00
Jehan	6365cad4fd	script: improve a bit the management of use_ascii option.	2022-12-14 00:24:53 +01:00
Jehan	81b83fffa9	script: work around recent issue of python wikipedia module. Adding `auto_suggest=False` to the wikipedia.page() call because this auto-suggest is completely broken, searching "mar ot" instead of "marmot" or "ground hug" instead of "Groundhog" (this one is extra funny but not so useful!). I actually wonder why it even needs to suggest anything when the Wikipedia pages do actually exist! Anyway the script BuildLangModel.py was very broken because of this, now it's better. See: https://github.com/goldsmith/Wikipedia/issues/295 Also printing the error message when we discard a page, which helps debugging.	2022-12-14 00:24:53 +01:00

1 2 3 4 5 ...

389 Commits