uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-02-07 10:19:59 +08:00

Author	SHA1	Message	Date
Jehan	2a16ab2310	src: nsEscCharsetProber also returns the correct language. nsEscCharsetProber will still only return a single candidate, because this is detected by a state machine, not language statistics anyway. Anyway now it will also return the language attached to the encoding.	2022-12-14 00:23:13 +01:00
Jehan	6138d9e0f0	src: make nsMBCSGroupProber report all valid candidates. Returning only the best one has limits, as it doesn't allow to check very close confidence candidates. Now in particular, the UTF-8 prober will return all ("UTF-8", lang) candidates for every language with probable statistical fit.	2022-12-14 00:23:13 +01:00
Jehan	2127f4fc0d	src: allow for nsCharSetProber to return several candidates. No functional change yet because all probers still return 1 candidate. Yet now we add a GetCandidates() method to return a number of candidates. GetCharSetName(), GetLanguage() and GetConfidence() now take a parameter which is the candidate index (which must be below the return value of GetCandidates()). We can now consider that nsCharSetProber computes a couple (charset, language) and that the confidence is for this specific couple, not just the confidence for charset detection.	2022-12-14 00:23:13 +01:00
Jehan	ea32980273	src: nsMBCSGroupProber confidence weighed by language confidence. Since our whole charset detection logics is based on text having meaning (using actual language statistics), just because a text is valid UTF-8 does not mean it is absolutely the right encoding. It may also fit other encoding with maybe very high statistical confidence (and therefore a better candidate). Therefore instead of just returning 0.99 or other high values, let's weigh our encoding confidence with the best language confidence.	2022-12-14 00:23:13 +01:00
Jehan	25d2890676	src: tweak again the language detection confidence. Computing a logical number of sequence was a big mistake. In particular, a language with only positive sequence would have the same score as a language with a mix of only positive and probable sequence (i.e. 1.0). Instead, just use the real number of sequence, but probable of sequence don't bring +1 to the numerator. Also drop the mTypicalPositiveRatio, at least for now. In my tests, it mostly made results worse. Maybe this would still make sense for language with a huge number of characters (like CJK languages), for which we won't have the full list of characters in our "frequent" list of characters. Yet for most other languages, we actually list all the possible sequences within the character set, therefore any sequence out of our sequence list should necessarily drop confidence. Tweaking the result backup up with some ratio is therefore counter-productive. As for CJK cases, we'll see how to handle the much higher number of sequences (too many to list them all) when we get there.	2022-12-14 00:23:13 +01:00
Jehan	1b5e68be00	test: update unit test to check detected languages. Excepting ASCII, UTF-16 and UTF-32 for which we don't detect languages yet.	2022-12-14 00:23:13 +01:00
Jehan	82c1d2b25e	src: reset language detectors when resetting a nsMBCSGroupProber.	2022-12-14 00:23:13 +01:00
Jehan	eb8308d50a	src, script: regenerate all existing language models. Now making sure that we have a generic language model working with UTF-8 for all 26 supported models which had single-byte encoding support until now.	2022-12-14 00:23:13 +01:00
Jehan	5257fc1abf	Using the generic language detector in UTF-8 detection. Now the UTF-8 prober would not only detect valid UTF-8, but would also detect the most probable language. Using the data generated 2 commits away, this works very well. This is still basic and will require even more improvements. In particular, now the nsUTF8Prober should return an array of ("UTF-8", language) couple candidate. And nsMBCSGroupProber should itself forward these candidates as well as other candidates from other multi-byte detectors. This way, the public-facing API would get more probable candidates, in case the algorithm is slightly wrong. Also the UTF-8 confidence is currently stupidly high as soon as we consider it to be right. We should likely weigh it with language detection (in particular, if no language is detected, this should severely weigh down UTF-8 detection; not to 0, but high enough to be a fallback in case no other encoding+lang is valid and low enough to give chances to other good candidate couples.	2022-12-14 00:23:13 +01:00
Jehan	dac7cbd30f	New generic language detector class. It detects languages similarly to the single byte encoding detector algorithm, based on character frequency and sequence frequency, except it does it generically from unicode codepoint, not caring at all about the original encoding. The confidence algorithm for language is very similar to the confidence algorithm for encoding+language in nsSBCharSetProber, though I tweaked it a little making it more trustworthy. And I plan to tweak it even a bit more later, as I improve progressively the detection logics with some of the idea I had.	2022-12-14 00:23:13 +01:00
Jehan	b70b1ebf88	Rebuild a bunch of language models. Adding generic language model (see coming commit), which uses the same data as specific single-byte encoding statistics model, except that it applies it to unicode code points. For this to work, instead of the CharToOrderMap which was mapping directly from encoded byte (always 256 values) to order, now we add an array of frequent characters, ordered by generic unicode code points to the order of frequency (which can be used on the same sequence mapping array). This of course means that each prober where we will want to use these generic models will have to implement their own byte to code point decoder, as this is per-encoding logics anyway. This will come in a subsequent commit.	2022-12-14 00:23:13 +01:00
Jehan	a0bfba3db3	src: add a --weight option to the CLI tool. Syntax is: lang1:weight1,lang2:weight2… For instance: `uchardet -wfr:1.1,it:1.05 file.txt` if you think a file is probably French or maybe Italian.	2022-12-14 00:23:13 +01:00
Jehan	669ede73a3	src: new weight concept in the C API. Pretty basic, you can weight prefered language and this will impact the result. Say the algorithm "hesitates" between encoding E1 in language L1 and encoding E2 in language L2. By setting L2 with a 1.1 weight, for instance because this is the OS language, or usual prefered language, you may help the algorithm to overcome very tight cases. It can also be helpful when you already know for sure the language of a document, you just don't know its encoding. Then you may set a very high value for this language, or simply set a default value of 0, and set 1 for this language. Only relevant encoding will be taken into account. This is still limited though as generic encoding are still implemented language-agnostic. UTF-8 for instance would be disadvantaged by this weight system until we make it language-aware.	2022-12-14 00:23:13 +01:00
Jehan	f74d602449	src: fix the usage of `uchardet` tool. It was displaying -v for both verbose and version options. The new --verbose short option is actually -V (uppercase).	2022-12-14 00:23:13 +01:00
Jehan	d48ee7abc2	src: `uchardet` tool now shows the language code in verbose mode.	2022-12-14 00:23:13 +01:00
Jehan	c550af99a7	script: update BuildLangModel.py to updated SequenceModel struct. In particular, there is now a language code member.	2022-12-14 00:23:13 +01:00
Jehan	5a949265d5	src: new API to get the detected language. This doesn't work for all probers yet, in particular not for the most generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL. It's still a good first step. Right now, it returns the 2-character language code from ISO 639-1. A using project could easily get the English language name from the XML/json files provided by the iso-codes project. This project will also allow to easily localize the language name in other languages through gettext (this is what we do in GIMP for instance). I don't add any dependency though and leave it to downstream projects to implement this. I was also wondering if we want to support region information for cases when it would make sense. I especially wondered about it for Chinese encodings as some of them seem quite specific to a region (according to Wikipedia at least). For the time being though, these just return "zh". We'll see later if it makes sense to be more accurate (maybe depending on reports?).	2022-12-14 00:23:13 +01:00
Jehan	e7bf25ca08	test: fix test script to use the new API and get rid of build warning.	2022-12-14 00:23:13 +01:00
Jehan	7bc1bc4e0a	src: new option --verbose\|-V in the `uchardet` CLI tool. This new option will give the whole candidate list as well as their respective confidence (ordered by higher to lower).	2022-12-14 00:23:13 +01:00
Jehan	8118133e00	src: new API to get all candidates and their confidence. Adding: - uchardet_get_candidates() - uchardet_get_encoding() - uchardet_get_confidence() Also deprecating uchardet_get_charset() to have developers look at the new API instead. I was unsure if this should really get deprecated as it makes the basic case simple, but the new API is just as easy anyway. You can also directly call uchardet_get_encoding() with candidate 0 (same as uchardet_get_charset(), it would then return "" when no candidate was found).	2022-12-14 00:23:13 +01:00
Jehan	15fc8f0a0f	src: now reporting encoding+confidence and keeping a list. Preparing for an updated API which will also allow to loop at the confidence value, as well as get the list of possible candidate (i.e. all detected encoding which had a confidence value high enough so that we would even consider them). It is still only internal logics though.	2022-12-14 00:23:13 +01:00
Jehan	2f5c24006e	README, doc: some README and release procedure updates.	2022-12-08 22:34:22 +01:00
Jehan	ae6302a016	Release: version 0.0.8. v0.0.8	2022-12-08 21:52:25 +01:00
Jehan	c218a3ccd6	README: add a section about CMake exported targets. Since it's a new feature, we may as well write about it, even though I would personally not recommend this in favor of more standard and generic pkg-config (which is not dependent on which build system we are using ourselves).	2022-11-30 23:48:16 +01:00
Jehan	6196f86c46	README: update with newly added (lang, charset) couples.	2022-11-30 20:06:52 +01:00
Jehan	388777be51	script, src, test: add IBM865 support for Danish. Newly added IBM865 charset (for Norwegian) can also be used for Danish By the way, I fixed `script/charsets/ibm865.py` as Danish uses the 'da' ISO 639-1 code by the way, not 'dk' (which is sometimes used for other codes for Denmark, such as ISO 3166 country code and internet TLD) but not for the language itself. For the test, adding some text from the top article of the day on the Danish Wikipedia, which was about Jimi Hendrix. And that's cool! 🎸 ;-)	2022-11-30 19:57:52 +01:00
Jehan	5aa628272b	script: fix small issues with commits e41e8a4 and 8d15d6b.	2022-11-30 19:24:28 +01:00
Martin T. H. Sandsmark	c11c362b89	Add tests for norwegian	2022-11-30 19:09:21 +01:00
Martin T. H. Sandsmark	099a9a4fd6	Add norwegian support	2022-11-30 19:09:09 +01:00
Martin T. H. Sandsmark	e41e8a47e4	improve model building script a bit	2022-11-30 19:09:09 +01:00
Martin T. H. Sandsmark	8d15d6b557	make the logfile usable	2022-11-30 19:09:09 +01:00
Jehan	2a04e57c8f	test: update the Maltese / ISO-8859-3 test file. Taken from the page: https://mt.wikipedia.org/wiki/Lingwa_Maltija The old test was fine but had some French words in it, which lowered the confidence for Maltese. Technically it should not be a huge issue in the end, i.e. that if there are enough actual Maltese words, the stats should still weigh in favor of Maltese likeness (which they mostly did anyway), but since I am making some other changes, this was just not enough. In particular I was changing some of the UTF-8 confidence logics and the file ended up detected as UTF-8 (even though it has illegal sequence and cannot be! Cf. #9). So the real long-term solution is to actually fix our UTF-8 detector, which I'll do at some point, but for the time being, let's have definite non-questionable Maltese in there to simplify testing at this early stage of uchardet rewriting.	2022-11-29 14:59:17 +01:00
Lucinda May Phipps	45bd32d102	src/tools/uchardet.cpp: make stuff static	2022-11-29 13:57:31 +00:00
Lucinda May Phipps	ef19faa8c5	Update uchardet-tests.c	2022-11-29 13:57:31 +00:00
Lucinda May Phipps	383bf118c9	don't use feof	2022-11-29 13:57:31 +00:00
myd7349	143b3fe513	README: update libchardet repository link	2022-08-01 19:38:19 +08:00
andiwand	23a664560b	Issue #27 : fix cmake	2021-12-01 13:49:37 +01:00
Jehan	b3b2bd2721	gitignore: I forgot the 2 executables (CLI tool and test binary).	2021-11-09 14:26:21 +01:00
Jehan	48db2b0800	gitignore: add files generated by the build system. Though it is highly encouraged to do out-of-source builds, it is not strictly forbidden to do in-source builds. So we should ignore the files generated by CMake. Only tested with a Linux build, with both make and ninja backends. I added .dll and .dylib versions (for Windows and macOS respectively), guessing these will be the file names on these platforms, unless mistaken (since untested). As discussed in !10, let's add with this commit files generated by the build system, but not any personal environment files (specific to contributors' environment). If I missed any file name which can be generated by the build system in some platforms, configuration, or condition, let's add them as we discover them.	2021-11-09 14:05:31 +01:00
Pedro López-Cabanillas	d7dad549bd	cmake exported targets The minimum required cmake version is raised to 3.1, because the exported targets started at that version. The build system creates the exported targets: - The executable uchardet::uchardet - The library uchardet::libuchardet - The static library uchardet::libuchardet_static A downstream project using CMake can find and link the library target directly with cmake (without needing pkg-config) this way: ~~~ project(sample LANGUAGES C) find_package ( uchardet ) if (uchardet_FOUND) add_executable( sample sample.c ) target_link_libraries ( sample PRIVATE uchardet::libuchardet ) endif () ~~~ After installing uchardet in a prefix like "$HOME/uchardet/": cmake -DCMAKE_PREFIX_PATH="$HOME/uchardet/;..." Instead installing, the build directory can be used directly, for instance: cmake -Duchardet_DIR="$HOME/uchardet-0.1.0/build/" ...	2021-11-09 09:52:15 +00:00
Aaron Madlon-Kay	6f38ab95f5	Mention MacPorts in readme	2021-01-27 06:57:58 +00:00
Jehan	c8a3572cca	Issue #17 : update README. Replace the old link to the science paper by one on archive-mozilla website. Remove the original source link as I can't find any archived version of it (even on archive.org, only the folder structure is saved, not actual files themselves, so it's useless). Also add some history, which is probably a nice touch. Add a link to crossroad to help people who'd want to cross-compile uchardet. Finally add the R binding by Artem Klevtsov and QtAV as reported.	2020-04-29 16:20:00 +02:00
Jehan	472a906844	Issue #16 : "i686" uname not properly detected as x86. This is basically a continuation of an older bug from Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101033	2020-04-28 20:43:12 +02:00
myd7349	8681fc060e	build: Add uchardet CLI tool building support for MSVC	2020-04-26 08:16:14 +00:00
myd7349	5bcbd23acf	build: Fix build errors on Windows - Fix string no output variables on UWP On UWP, CMAKE_SYSTEM_PROCESSOR may be empty. As a result: string(TOLOWER ${CMAKE_SYSTEM_PROCESSOR} TARGET_ARCHITECTURE) will be treated as: string(TOLOWER TARGET_ARCHITECTURE) which, as a result, will cause a CMake error: CMake Error at CMakeLists.txt:42 (string): string no output variable specified - Remove unnecessary header inclusions in uchardet.cpp These extra inclusions cause build errors on Windows.	2020-04-26 10:08:45 +08:00
Jehan	a49f8ef6ea	doc: update README.maintainer. There is one more step to transform a git tag into a proper "Gitlab release" with the new platform.	2020-04-23 12:32:49 +02:00
Jehan	59f68dbe57	Release: version 0.0.7 v0.0.7	2020-04-23 11:48:58 +02:00
Jehan	98bc2f31ef	Issue #8 : have BuildLangModel.py add ending newline to generated source.	2020-04-22 22:57:25 +02:00
Jehan	44a50c30ee	Issue #8 : no newline at end of file. Not sure if it is in the C++ standard, or was, but apparently some compilers may complain when files don't end with a newline (though neither GCC nor Clang as our CI and my local builds are fine). So here are all our generated source which didn't have such ending newline (hopefully I forgot none). I just loaded them in my vim editor, and resaved them. This was enough to add an ending newline.	2020-04-22 22:53:25 +02:00
Jehan	6c7f32a751	Issue #10 : Crashing sequence with nsSJISProber. uchardet_handle_data() should not try to process data of nul length. Still this is not technically an error to feed empty data to the engine, and I could imagine it could happen especially when done in some automatic process with random input files (which looks like what was happening in the reporter case). So feeding empty data just returns a success without actually doing any processing, allowing to continue the data feed.	2020-04-22 22:11:51 +02:00

1 2 3 4 5 ...

307 Commits