Commit Graph

  • 2be37a8746 Merge branch 'devel' into 'master' Pedro López-Cabanillas 2025-08-10 19:23:32 +02:00
  • ffc9c33a0f Revert "Create linux-build.yml" Pedro López-Cabanillas 2025-08-10 19:20:46 +02:00
  • 3c1db7c8b1 Merge branch uchardet:master into devel Pedro López-Cabanillas 2025-08-10 19:17:48 +02:00
  • 06029ec334 src: allow setting a default language in the CLI tool. master Jehan 2025-08-08 11:40:10 +02:00
  • b516d6f9ca gitlab-ci: python3-distutils package doesn't exist anymore in Debian testing. wip/Jehan/no-distutils Jehan 2025-06-08 02:08:58 +02:00
  • 9699dfce07 Issue #40: Close file when it's no longer needed Marcus Nilsson 2024-08-21 16:45:57 +02:00
  • 0d86c111a7 fix for gb18030 encoding test Pedro López-Cabanillas 2024-03-31 16:21:06 +02:00
  • 87ed76971b Create linux-build.yml Pedro López-Cabanillas 2024-03-31 13:06:44 +02:00
  • dff8906402 fix: FTBFS under MSVC Gary Wang 2024-09-22 16:16:22 +08:00
  • 6e163c978a CMake: Raise required version to 3.5 Heiko Becker 2025-02-20 23:18:54 +01:00
  • edae8e81cf gitlab-ci: CI is now forbidden on MR run by passing-by contributors. Jehan 2023-11-15 17:05:34 +01:00
  • b95252ff0c Add notepad++ to readme Jaroslav Lobačevski 2023-11-15 13:39:38 +00:00
  • ab1d2f1120 src: handle long sequences of characters. Jehan 2023-07-17 20:09:10 +02:00
  • 9910941387 Issue #33: crafted sequence of bytes triggers memory write past the bounds of… Jehan 2023-07-17 18:46:35 +02:00
  • 8fe0b2e080 src: fix mismatched new [] / delete. Jehan 2023-07-17 16:43:45 +02:00
  • bc93da89d9 Issue #32: Global buffer read overflow in GetOrderFromCodePoint. Jehan 2023-07-17 16:39:52 +02:00
  • bd983ca108 CMake: enable ASAN in Debug builds. Jehan 2023-07-17 16:29:08 +02:00
  • bdd71d88f8 script: improve a bit create-table.py and regenerate the Georgian charsets. Jehan 2022-12-20 14:38:51 +01:00
  • 7875272a8c script, src, test: new Georgian support. Jehan 2022-12-20 14:23:24 +01:00
  • c843d23a17 script: new create-table script. Jehan 2022-12-20 12:03:19 +01:00
  • 419a971e6a script: update the README. Jehan 2022-12-20 01:56:24 +01:00
  • d40e5868d5 script, src, test: adding Catalan support. Jehan 2022-12-20 01:46:15 +01:00
  • cec8817d79 src: new Big5 detection implementation. Jehan 2022-12-18 23:53:14 +01:00
  • 0fe51d3851 Issue #21: Greek CP737 support. Jehan 2022-12-18 22:28:54 +01:00
  • a82139b3bd script: fix a notice message. Jehan 2022-12-18 22:24:55 +01:00
  • d4ef245fdc script: add a requirements.txt for our generation script. Jehan 2022-12-18 17:27:38 +01:00
  • db836fad63 script, src: generate more code for language and sequence model listing. Jehan 2022-12-18 17:13:17 +01:00
  • d6cab28fb4 README: missing UTF-8 support listed on several languages. Jehan 2022-12-17 23:00:26 +01:00
  • abd123e07d script, src, test: add Serbian support. Jehan 2022-12-17 22:46:13 +01:00
  • d00d4d52b7 src, script: add Macedonian support. Jehan 2022-12-17 22:25:32 +01:00
  • 41d309e8a2 script, src: regenerate Russian models and add UTF-8/Russian support. Jehan 2022-12-17 21:32:24 +01:00
  • 60dcec8a82 script, src, test: add Ukrainian support. Jehan 2022-12-17 21:24:59 +01:00
  • 0fffc109b5 script, src, test: adding Belarusian support. Jehan 2022-12-17 19:13:03 +01:00
  • ffb94e4a9d script, src, test: Bulgarian language models added. Jehan 2022-12-17 18:30:55 +01:00
  • 5e25e93da7 script: add an error handling for when iconv fail to convert from a codepoint. Jehan 2022-12-17 18:00:22 +01:00
  • 6d31689632 test: adding 2 tests for Hebrew/IBM862 recognition. wip/Jehan/improved-API Jehan 2022-12-16 23:28:28 +01:00
  • 0974920bdd Issue #22: Hebrew CP862 support. Jehan 2022-12-16 23:17:47 +01:00
  • 127d7faf47 test: add ability to have several tests per charsets. Jehan 2022-12-16 23:10:34 +01:00
  • 3a6806ab19 test: no:utf-8 is actually working now, after the last model script fix… Jehan 2022-12-15 15:11:17 +01:00
  • e6e51d9fe8 src: all language models now rebuilt after the fix. Jehan 2022-12-15 14:31:31 +01:00
  • 362086bf56 script: fix BuildLangModel.py. Jehan 2022-12-15 14:31:10 +01:00
  • 598fe90c91 test: finally add English/UTF-8 test file. Jehan 2022-12-14 21:45:29 +01:00
  • 6bb1b3e101 scripts: all language models rebuilt with the new ratio data. Jehan 2022-12-14 20:16:44 +01:00
  • e311b64cd9 script: model-building script updated to produce the 2 new ratios… Jehan 2022-12-14 20:15:01 +01:00
  • 401eb55dfc src: improve algorithm for confidence computation. Jehan 2022-12-14 20:02:59 +01:00
  • 4f35cd4416 src: when checking for candidates, make sure we haven't any unprocessed… Jehan 2022-12-14 08:39:49 +01:00
  • 7f386d922e script, src: rebuild the English model. Jehan 2022-12-14 00:32:52 +01:00
  • fb433a57b5 src: add a --language|-l option to the uchardet CLI tool. Jehan 2022-12-14 00:15:34 +01:00
  • 908f9b8ba7 src, test: rename s/uchardet_get_candidates/uchardet_get_n_candidates/. Jehan 2022-12-13 23:45:22 +01:00
  • a916fb1c56 test: temporarily disable the Norwegian/UTF-8 test. Jehan 2022-12-13 23:33:50 +01:00
  • baeefc0958 src: process pending language data when we are going to pass buffer size. Jehan 2022-12-13 23:28:40 +01:00
  • b5b75b81ce script, src: rebuild the Danish model. Jehan 2022-11-30 20:58:37 +01:00
  • 0be80a21db script, src: update Norwegian model with the new language features. Jehan 2022-11-30 20:33:11 +01:00
  • 784f614c84 script: further fixing BuildLangModel.py. Jehan 2022-11-30 20:17:25 +01:00
  • 6365cad4fd script: improve a bit the management of use_ascii option. Jehan 2021-11-09 22:18:11 +01:00
  • 81b83fffa9 script: work around recent issue of python wikipedia module. Jehan 2021-11-09 22:06:47 +01:00
  • a3ff09bece test: improve test error output even more. Jehan 2021-11-09 15:05:38 +01:00
  • c9446e540d test: add stderr logging when a test fails. Jehan 2021-11-09 14:32:03 +01:00
  • bfa4b10d4d script, src: add English language model. Jehan 2021-05-23 19:33:36 +02:00
  • bed459c6e7 src: drop less of UTF-8 confidence even with few non-multibyte chars. Jehan 2021-05-23 17:04:37 +02:00
  • bffb7819d2 test: fix test binary build for Windows. Jehan 2021-03-22 21:06:20 +01:00
  • 5cf3c648fb src: reset shortcut charset/language on Reset(). Jehan 2021-03-22 18:29:34 +01:00
  • d6c5c26150 src: do not test with nsLatin1Prober anymore. Jehan 2021-03-22 18:15:34 +01:00
  • 6436e1dd47 src: improve confidence computation (generic and single-byte charset). Jehan 2021-03-22 18:03:02 +01:00
  • 8e2cf7b81b script: generate more complete frequent characters when range is set. Jehan 2021-03-22 17:44:06 +01:00
  • 314f062c70 script, src: regenerate the Thai model. Jehan 2021-03-22 17:06:27 +01:00
  • 41fec68674 src, script: fix the order of characters for Vietnamese. Jehan 2021-03-21 16:02:03 +01:00
  • 338a51564a src, script: add concept of alphabet_mapping in language models. Jehan 2021-03-21 15:54:24 +01:00
  • ba7d72e3b0 script: regenerate Slovak and Slovene with better alphabet support. Jehan 2021-03-21 13:30:41 +01:00
  • adb158b058 script: fix a stupid bug making same ratio for all frequent characters. Jehan 2021-03-21 12:30:29 +01:00
  • 19737886fe script, src: regenerate the Vietnamese model. Jehan 2021-03-21 01:12:56 +01:00
  • 9d29c3e26f src: fix negative confidence wrapping around because of unsigned int. Jehan 2021-03-20 23:02:10 +01:00
  • b7acffc806 script, src: remove generated statistics data for Korean. Jehan 2021-03-20 22:59:52 +01:00
  • b725c0b2ff src: new nsCJKDetector specifically Chinese/Japanese/Korean recognition. Jehan 2021-03-20 22:12:45 +01:00
  • c782177a8d README: fix a duplicate. Jehan 2021-03-19 23:45:30 +01:00
  • 3ca49e2bc1 Update README. Jehan 2021-03-19 23:24:34 +01:00
  • 8113f604de src: consider any combination with a non-frequent character as sequence. Jehan 2021-03-19 22:37:27 +01:00
  • a1b186fa8b src: add Hindi/UTF-8 support. Jehan 2021-03-19 22:34:55 +01:00
  • 9736950227 src: improve confidence computation. Jehan 2021-03-19 21:46:53 +01:00
  • a98cdcd88f script: fix a bit BuildLangModel.py when use_ascii is True. Jehan 2021-03-19 18:38:30 +01:00
  • 629bc879f3 script, src: add generic Korean model. Jehan 2021-03-18 17:51:22 +01:00
  • 0d152ff430 src, test: fix the new Johab prober and add a test. Jehan 2021-03-18 00:23:13 +01:00
  • 3996b9d648 src: build new charset prober for Johab Korean. Jehan 2021-03-14 12:59:25 +01:00
  • d72a5c88ce add charset prober for Johab Korean LSY 2019-03-14 06:34:42 +09:00
  • ded948ce15 script, src: generate the Hebrew models. Jehan 2021-03-17 23:22:50 +01:00
  • cf0ffb0c55 test: 4 new tests for UTF-8. Jehan 2021-03-17 22:27:24 +01:00
  • a7c5a167a9 src: drop the SURE_YES confidence for character distribution probers. Jehan 2021-03-17 21:32:49 +01:00
  • b00c85a6a6 src: do not shortcut UTF-8 detection too early. Jehan 2021-03-17 21:26:31 +01:00
  • 2a16ab2310 src: nsEscCharsetProber also returns the correct language. Jehan 2021-03-17 17:15:56 +01:00
  • 6138d9e0f0 src: make nsMBCSGroupProber report all valid candidates. Jehan 2021-03-17 16:34:26 +01:00
  • 2127f4fc0d src: allow for nsCharSetProber to return several candidates. Jehan 2021-03-17 13:23:33 +01:00
  • ea32980273 src: nsMBCSGroupProber confidence weighed by language confidence. Jehan 2021-03-17 13:09:10 +01:00
  • 25d2890676 src: tweak again the language detection confidence. Jehan 2021-03-17 12:51:25 +01:00
  • 1b5e68be00 test: update unit test to check detected languages. Jehan 2021-03-17 12:39:54 +01:00
  • 82c1d2b25e src: reset language detectors when resetting a nsMBCSGroupProber. Jehan 2021-03-17 11:03:30 +01:00
  • eb8308d50a src, script: regenerate all existing language models. Jehan 2021-03-17 02:07:17 +01:00
  • 5257fc1abf Using the generic language detector in UTF-8 detection. Jehan 2021-03-15 12:01:35 +01:00
  • dac7cbd30f New generic language detector class. Jehan 2021-03-16 12:05:56 +01:00
  • b70b1ebf88 Rebuild a bunch of language models. Jehan 2021-03-15 10:20:14 +01:00
  • a0bfba3db3 src: add a --weight option to the CLI tool. Jehan 2020-04-27 18:14:34 +02:00