Commit Graph

  • f53cb8cddd CMake: fix linking with Ninja Ricardo Constantino (:RiCON) 2016-03-16 14:17:47 +00:00
  • 36665da832 CMake: allow installing binary to non-default dir Ricardo Constantino (:RiCON) 2016-03-16 14:17:25 +00:00
  • 198190461e script: move the Wikipedia title syntax cleaning to BuildLangModel.py. Jehan 2016-02-21 16:20:22 +01:00
  • d24bd7d578 script: Wikipedia API's python wrapper does not return garbage text anymore. Jehan 2016-02-21 16:07:10 +01:00
  • 37024460fe script: add a README file dedicated to adding new support. Jehan 2016-02-21 16:06:11 +01:00
  • 42c6b42f65 Add a DOAP file. Jehan 2016-02-21 15:19:50 +01:00
  • d5dba26e04 README: add Danish support for 3 charsets. Jehan 2016-02-19 19:11:56 +01:00
  • 923d264470 LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15). Jehan 2016-02-19 19:07:20 +01:00
  • 1694999bce README: update with VISCII support. Jehan 2016-02-13 03:52:06 +01:00
  • 98b5e52252 LangModels: add VISCII encoding support and retrain Vietnamese model. Jehan 2016-02-13 03:51:18 +01:00
  • 600cf76a76 BuildLangModel: try using iconv for conversion when support missing... Jehan 2016-02-13 03:47:41 +01:00
  • 178c6119b8 LangModels: add Windows-1258 support for Vietnamese. Jehan 2016-02-13 02:32:57 +01:00
  • 27135a8880 BuildLangModel: printing a message when discarding a page. Jehan 2016-02-13 02:27:15 +01:00
  • 0446e24c8d README: uchardet now available on Fedora. Jehan 2016-02-12 17:53:22 +01:00
  • 248d6dbd35 tools: exit with non-zero value on uchardet error. Jehan 2016-01-21 18:16:42 +01:00
  • b6d872bbec app: package name wrong in CMakeLists.txt. Jehan 2015-12-15 21:40:16 +01:00
  • 706023139c tests: add test files for Arabic. Jehan 2015-12-13 18:42:59 +01:00
  • 9c3c37517c LangModels: add Arabic support. Jehan 2015-12-13 18:42:16 +01:00
  • ad2f7212e2 LangModels: retraining Greek models with my training script. Jehan 2015-12-13 18:00:07 +01:00
  • 1b4c62ac21 tests: test files for Spanish. Jehan 2015-12-12 18:55:43 +01:00
  • ffabb65712 LangModels: adding Spanish support. Jehan 2015-12-12 18:54:35 +01:00
  • 055332ac7d BuildLangModel: allow the alphabet list to be written in string format. Jehan 2015-12-12 18:50:29 +01:00
  • 6b2722885a BuildLangModel: forgot to add charset/language files. Jehan 2015-12-12 18:18:08 +01:00
  • 2bade77bf9 tests: update Window-1250 test file for Hungarian. Jehan 2015-12-12 18:07:01 +01:00
  • a251753db8 LangModels: updating Hungarian language models. Jehan 2015-12-12 18:06:17 +01:00
  • 7b4eb9827e BuildLangModel: add an exception handler on charset spec errors. Jehan 2015-12-12 18:00:30 +01:00
  • 4c8316f9cf Nearly-ASCII text with NBSP is still not ASCII. Jehan 2015-12-05 21:04:20 +01:00
  • 886e03a523 Release: version 0.0.5. v0.0.5 Jehan 2015-12-04 22:45:26 +01:00
  • fe7bf3e994 test: update UTF-16 and UTF-32 tests after label changing. Jehan 2015-12-04 19:46:51 +01:00
  • e5234d6b61 Stating endianness of UTF-16 and UTF-32 was an error when BOM present. Jehan 2015-12-04 19:19:39 +01:00
  • 2856e68aac README: reorganize support list by alphabetic order. Jehan 2015-12-04 03:31:58 +01:00
  • 5691dc59a1 LangModels: rename Cyrillic models to Russian models. Jehan 2015-12-04 03:27:29 +01:00
  • 569509f844 BuildLangModel: forgot to add logs for Thai models generation. Jehan 2015-12-04 03:26:52 +01:00
  • dc03ea002f README: supports are per-language rather than per script system. Jehan 2015-12-04 03:21:23 +01:00
  • fb3c47a073 LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models. Jehan 2015-12-04 03:11:09 +01:00
  • ffcd85f709 script: forgot to commit ISO-8859-9 and Turkish files. Jehan 2015-12-04 02:40:54 +01:00
  • 5ee1c3ee39 LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9. Jehan 2015-12-04 02:35:09 +01:00
  • 22b9ed2d4f BuildLangModel: add concept of custom_case_mapping… Jehan 2015-12-04 02:29:40 +01:00
  • f0e122b506 LangModels: add Esperanto ISO-8859-3 language model. Jehan 2015-12-04 01:35:56 +01:00
  • a167bd5e42 BuildLangModel: lowercase only when resulting char has a composed form. Jehan 2015-12-04 01:30:21 +01:00
  • b56a3c7b84 README: add German support. Jehan 2015-12-04 00:07:03 +01:00
  • 55b4f23971 Single Byte charsets: high ctrl character ratio lowers confidence. Jehan 2015-12-04 00:00:33 +01:00
  • aa587a64bd LangModels: adding German models for ISO-8859-1 and Windows-1252. Jehan 2015-12-03 23:58:41 +01:00
  • 90728e4068 README: update with Windows-1252 support information. Jehan 2015-12-03 21:25:53 +01:00
  • 0270b1e856 Adding French Windows-1252 support. Jehan 2015-12-03 21:22:30 +01:00
  • 5d3fb3dc2f test: add a Windows-1252 French test. Jehan 2015-12-03 21:20:15 +01:00
  • 15afc5c593 test: add a Hungarian Windows-1250 test but skip it for now. Jehan 2015-12-03 21:18:55 +01:00
  • ea34e8b1bd Update doc comment. Jehan 2015-12-03 20:35:26 +01:00
  • 60f641bf37 Update README to mark independence with original Mozilla code. Jehan 2015-12-03 20:32:57 +01:00
  • e4260f4a39 Release: version 0.0.4. v0.0.4 Jehan 2015-12-03 19:48:58 +01:00
  • ba56d91808 Update uchardet URL in various places. Jehan 2015-12-03 19:48:29 +01:00
  • d1bc09e4d7 Update authors. Jehan 2015-12-03 19:44:13 +01:00
  • c4fa728e7a Merge branch 'master' of https://github.com/lovasoa/uchardet into lovasoa-master Jehan 2015-12-03 19:10:33 +01:00
  • d686fcc1cd LangModels: add illegal codepoints information on single byte charmaps. Jehan 2015-12-03 19:04:07 +01:00
  • 683255278d Re-enable Hungarian language models. Jehan 2015-12-02 22:24:36 +01:00
  • f4f9fc3f28 test: reenable Windows-1251 test for Russian. Jehan 2015-12-02 21:53:27 +01:00
  • 9dd6b34e93 test: add French UTF-8 test. Jehan 2015-11-30 20:03:33 +01:00
  • 4f1c3ff85e nsSBCharSetProber: multiply confidence by ratio of positive seqs per chars. Jehan 2015-11-30 19:52:07 +01:00
  • 9cb5764b73 LangModels: update the French language models. Jehan 2015-11-30 19:20:55 +01:00
  • dc5caa46bc BuildLangModel: fix hardcoded file names. Jehan 2015-11-30 19:18:25 +01:00
  • 3e5d37a6b5 BuildLangModel: process pages level per level. Jehan 2015-11-30 19:12:04 +01:00
  • 04f9309932 tests: update ISO-8859-15 French test file. Jehan 2015-11-30 00:19:15 +01:00
  • d9d347099e BuildLangModel: fix some minor comment from a previous spec. Jehan 2015-11-30 00:08:22 +01:00
  • 192f8de165 BuildLangModel: build models with computed frequent characters count. Jehan 2015-11-30 00:04:44 +01:00
  • 429448199f French language model: fix a start page. Jehan 2015-11-29 23:55:03 +01:00
  • dbb4c1d2ff nsSBCharSetProber: replace the fixed 64 SAMPLE_SIZE... Jehan 2015-11-29 23:51:55 +01:00
  • b64831ff89 BuildLangModel: allow a list of start pages... Jehan 2015-11-29 15:51:23 +01:00
  • dce79a6631 BuildLangModel: the SequenceModel naming must include the language name. Jehan 2015-11-29 15:49:56 +01:00
  • c59465adfc BuildLangModel: save lang model directly in the right directory. Jehan 2015-11-29 13:26:10 +01:00
  • 72fbd33dec Add a .gitignore. Jehan 2015-11-29 02:27:42 +01:00
  • 290fbd2e2e BuildLangModel: add the licensing header to generated files. Jehan 2015-11-29 02:26:33 +01:00
  • 7f290975ba BuildLangModel: map different cases of the same character together. Jehan 2015-11-29 02:14:48 +01:00
  • 00a78faa1d BuildLangModel: the max_depth should be a script option... Jehan 2015-11-29 01:59:28 +01:00
  • 274386f424 BuildLangModel: add a --max-page option to limit data size. Jehan 2015-11-29 01:42:36 +01:00
  • 0314f98ece BuildLangModel.py: some in-progress script to build language models. Jehan 2015-11-29 01:30:04 +01:00
  • a8e9de307b Add UTF-16 test files without BOM... Jehan 2015-11-28 19:50:18 +01:00
  • 92efc0b0b0 Update README: Unicode is "International". Jehan 2015-11-28 19:44:13 +01:00
  • 573b303fe3 Add an ASCII test file for English... Jehan 2015-11-28 17:49:13 +01:00
  • 0289c2a232 Differentiate ASCII and detection failure. Jehan 2015-11-28 16:44:09 +01:00
  • 4dbc6e7ab3 Update README with French support. Jehan 2015-11-28 02:20:57 +01:00
  • 50588ba375 Add a ISO-8859-15 test file for French. Jehan 2015-11-28 02:18:57 +01:00
  • 005fd98086 Add initial support for French with ISO-8859-1 and ISO-8859-15. Jehan 2015-11-28 02:14:39 +01:00
  • 2106173546 Move all Single-Byte language models to a subdirectory. Jehan 2015-11-27 23:11:23 +01:00
  • b67370230b Update README and manual... Jehan 2015-11-27 18:27:11 +01:00
  • 984d8f7b09 Add language information in model names when they were missing. Jehan 2015-11-27 18:21:13 +01:00
  • c61e65aeb3 s/MACCYRILLIC/MAC-CYRILLIC/ Jehan 2015-11-27 18:19:02 +01:00
  • 942ac05ff5 Add some Russian test files. Jehan 2015-11-27 18:07:01 +01:00
  • 42b91898da Create 3-letter constants for special charmap characters. Jehan 2015-11-27 17:41:54 +01:00
  • 7fa0fefef8 Add UTF-16 and UTF-32 test files in French, with BOM. Jehan 2015-11-24 19:22:09 +01:00
  • 5ef60164fc Stop detection early on control characters Ophir LOJKINE 2015-11-24 22:07:41 +03:00
  • e8dd55995a Add "LE/BE" suffix to "UTF-16" result for Little/Big Endian info... Jehan 2015-11-24 18:50:23 +01:00
  • 9a74d08b3c Fix minor space issues. Jehan 2015-11-24 00:15:44 +01:00
  • d082704fec Add Mageia command and specify Mint compatibility. Jehan 2015-11-23 17:45:06 +01:00
  • ff5fd5eff9 Release: version 0.0.3. v0.0.3 Jehan 2015-11-19 15:18:11 +01:00
  • 5dcff7b241 Hide away tests known to fail. Jehan 2015-11-18 20:01:12 +01:00
  • 4b38e68aa2 CMake tests: separate the lang and charset with colon... Jehan 2015-11-18 19:42:35 +01:00
  • 35153b1e50 Fixes boolean operation precedence warnings... Jehan 2015-11-18 19:38:12 +01:00
  • 0d70a36910 Adding some more test files for Russian and Chinese. Jehan 2015-11-18 19:10:52 +01:00
  • eb727d3aca Add automatic testing against every test file. Jehan 2015-11-18 18:17:01 +01:00
  • f303a41735 Add Thai test file for UTF-8. Jehan 2015-11-18 03:26:34 +01:00