uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-02-05 17:30:09 +08:00

Author	SHA1	Message	Date
Jehan	d1bc09e4d7	Update authors. I think I deserved being listed in the authors by now. ;-)	2015-12-03 19:44:13 +01:00
Jehan	c4fa728e7a	Merge branch 'master' of https://github.com/lovasoa/uchardet into lovasoa-master Let's shortcut Single Byte charset detection on invalid codepoints. Merging and fixing the contributor's commit conflicts after code redesign: in particular we added an illegal character concept (they were mixed with control characters in current charmaps. Yet ctrl characters are NOT to be considered invalid) and constants instead of hardcoded numbers ('ILL' rather than 255).	2015-12-03 19:26:19 +01:00
Jehan	d686fcc1cd	LangModels: add illegal codepoints information on single byte charmaps.	2015-12-03 19:04:07 +01:00
Jehan	683255278d	Re-enable Hungarian language models. Now that we have at least one model for ISO-8859-1, the risk of detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.	2015-12-02 22:24:36 +01:00
Jehan	f4f9fc3f28	test: reenable Windows-1251 test for Russian. Commit 4f1c3ff actually fixed it!	2015-12-02 21:53:27 +01:00
Jehan	9dd6b34e93	test: add French UTF-8 test. Text from: https://fr.wikipedia.org/wiki/UTF-8	2015-11-30 20:03:33 +01:00
Jehan	4f1c3ff85e	nsSBCharSetProber: multiply confidence by ratio of positive seqs per chars. If all sequences in a text are positive sequences, the ratio of positive sequences cannot make the difference between 2 very close charsets. A ratio of positive sequences per letters on the other hand will change a tie between 2 encoding. If while adding a letter, the number of positive sequences does not increase, the confidence will decrease (corresponding to the fact it was likely not a letter). On the other hand, if the number of positive sequences increase, so will the confidence. For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15. When letters only available in ISO-8859-15 appear in a text, we expect confidence to tilt towards the close yet slightly different ISO-8859-15.	2015-11-30 19:52:07 +01:00
Jehan	9cb5764b73	LangModels: update the French language models. Fully built with the script.	2015-11-30 19:20:55 +01:00
Jehan	dc5caa46bc	BuildLangModel: fix hardcoded file names.	2015-11-30 19:18:25 +01:00
Jehan	3e5d37a6b5	BuildLangModel: process pages level per level. I.e. horizontally or "breadth first" rather than vertical tree traversal. This allows to make sure all the start pages in particular are searched, when using max_page option.	2015-11-30 19:12:04 +01:00
Jehan	04f9309932	tests: update ISO-8859-15 French test file. Previous technical text about charsets themselves were not relevant to identify a language. In particular the special characters different between ISO-8859-1 and ISO-8859-15 were used by themselves, out of a char sequence context. Therefore without language understanding, they could have as well been representing the ISO-8859-15 letters or the ISO-8859-1 symbols at the corresponding codepoints. Replacing with text from this Wikipedia page: https://fr.wikipedia.org/wiki/Œuf_(cuisine) This uses some of these same characters (in particular 'œ') but in contextual character sequences, making it relevant for our algorithm.	2015-11-30 00:19:15 +01:00
Jehan	d9d347099e	BuildLangModel: fix some minor comment from a previous spec.	2015-11-30 00:09:23 +01:00
Jehan	192f8de165	BuildLangModel: build models with computed frequent characters count.	2015-11-30 00:04:44 +01:00
Jehan	429448199f	French language model: fix a start page. Because of a bug in the Wikipedia querying Python library.	2015-11-29 23:55:03 +01:00
Jehan	dbb4c1d2ff	nsSBCharSetProber: replace the fixed 64 SAMPLE_SIZE... ... with per-language model "frequent character" count.	2015-11-29 23:51:55 +01:00
Jehan	b64831ff89	BuildLangModel: allow a list of start pages... ... and add a page with a word with œ in French to make sure we have such words in our stats.	2015-11-29 15:51:23 +01:00
Jehan	dce79a6631	BuildLangModel: the SequenceModel naming must include the language name.	2015-11-29 15:49:56 +01:00
Jehan	c59465adfc	BuildLangModel: save lang model directly in the right directory.	2015-11-29 13:26:10 +01:00
Jehan	72fbd33dec	Add a .gitignore.	2015-11-29 02:27:42 +01:00
Jehan	290fbd2e2e	BuildLangModel: add the licensing header to generated files.	2015-11-29 02:26:33 +01:00
Jehan	7f290975ba	BuildLangModel: map different cases of the same character together. With the new case_mapping lang property, we can consider upper and lower case versions of the same character as one character. This makes sense in some language, and would allow to enter some rarer characters (but still in the main alphabet) inside the frequent character list. For instance 'œ' and 'Œ' in French.	2015-11-29 02:14:48 +01:00
Jehan	00a78faa1d	BuildLangModel: the max_depth should be a script option... ... rather than a language property.	2015-11-29 01:59:28 +01:00
Jehan	274386f424	BuildLangModel: add a --max-page option to limit data size. This is mostly useful for debugging while we don't want to wait forever to test the script.	2015-11-29 01:42:36 +01:00
Jehan	0314f98ece	BuildLangModel.py: some in-progress script to build language models.	2015-11-29 01:30:04 +01:00
Jehan	a8e9de307b	Add UTF-16 test files without BOM... ... and disable the tests for now for these since uchardet is not able to detect UTF-16 without a BOM as for now.	2015-11-28 19:50:18 +01:00
Jehan	92efc0b0b0	Update README: Unicode is "International".	2015-11-28 19:44:13 +01:00
Jehan	573b303fe3	Add an ASCII test file for English... ... with escape characters because even with ESC, a file is ASCII unless proven otherwise.	2015-11-28 17:49:13 +01:00
Jehan	0289c2a232	Differentiate ASCII and detection failure. The lib used to return "" for both properly detected ASCII and detection failure. And the tool would return "ascii/unknown". Make a proper distinction between the 2 cases.	2015-11-28 17:04:52 +01:00
Jehan	4dbc6e7ab3	Update README with French support.	2015-11-28 02:20:57 +01:00
Jehan	50588ba375	Add a ISO-8859-15 test file for French.	2015-11-28 02:18:57 +01:00
Jehan	005fd98086	Add initial support for French with ISO-8859-1 and ISO-8859-15. Mostly generated with a script from Wikipedia data (only the typical positive ratio is slightly modified). This is a first test before adding my generating script to the main tree.	2015-11-28 02:14:39 +01:00
Jehan	2106173546	Move all Single-Byte language models to a subdirectory.	2015-11-27 23:11:23 +01:00
Jehan	b67370230b	Update README and manual... ... to indicate several files can be specified on command line.	2015-11-27 18:27:11 +01:00
Jehan	984d8f7b09	Add language information in model names when they were missing. Models are language specific (there could be several models for the same charset but different languages). Let's have a clear naming scheme.	2015-11-27 18:21:13 +01:00
Jehan	c61e65aeb3	s/MACCYRILLIC/MAC-CYRILLIC/ Write encoding names in README same as what uchardet returns.	2015-11-27 18:19:02 +01:00
Jehan	942ac05ff5	Add some Russian test files. Texts from: IBM855: https://ru.wikipedia.org/wiki/CP855 IBM866: https://ru.wikipedia.org/wiki/Альтернативная_кодировка MAC-CYRILLIC: https://ru.wikipedia.org/wiki/MacCyrillic	2015-11-27 18:17:20 +01:00
Jehan	42b91898da	Create 3-letter constants for special charmap characters. Control characters, carriage, symbols and numbers. Also add a constant for illegal characters (not used for now). This will allow easier processing and charmap reading.	2015-11-27 17:41:54 +01:00
Jehan	7fa0fefef8	Add UTF-16 and UTF-32 test files in French, with BOM. Unfortunately uchardet currently seems unable to detect UTF-16/32 text without a BOM.	2015-11-26 02:45:00 +01:00
Ophir LOJKINE	5ef60164fc	Stop detection early on control characters	2015-11-24 22:07:41 +03:00
Jehan	e8dd55995a	Add "LE/BE" suffix to "UTF-16" result for Little/Big Endian info... ... and add UTF-32 BOM detection.	2015-11-24 18:50:23 +01:00
Jehan	9a74d08b3c	Fix minor space issues.	2015-11-24 00:15:44 +01:00
Jehan	d082704fec	Add Mageia command and specify Mint compatibility.	2015-11-23 17:46:01 +01:00
Jehan	ff5fd5eff9	Release: version 0.0.3. v0.0.3	2015-11-19 15:18:11 +01:00
Jehan	5dcff7b241	Hide away tests known to fail. Some charsets are simply not supported (ex: fr:iso-8859-1), some are temporarily deactivated (ex: hu:iso-8859-2) and some are wrongly detected as closely related charsets. These were broken (or not efficient) from the start, and there is no need to pollute the `make test` output with these, which may make us miss when actual regressions will occur. So let's hide these away for now until we can improve the situation.	2015-11-18 20:02:58 +01:00
Jehan	4b38e68aa2	CMake tests: separate the lang and charset with colon... ... rather than an hyphen. It makes it easier to read.	2015-11-18 19:42:35 +01:00
Jehan	35153b1e50	Fixes boolean operation precedence warnings... ... and some minor space issues. Some explicit parentheses were needed to make precedence obvious. Warning was: "warning: suggest parentheses around ‘&&’ within ‘\|\|’ [-Wparentheses]"	2015-11-18 19:38:12 +01:00
Jehan	0d70a36910	Adding some more test files for Russian and Chinese. Taken from: https://zh.wikipedia.org/wiki/EUC https://ru.wikipedia.org/wiki/КОИ-8 And rename a file s/utf8.txt/utf-8.txt/ to fix a build test.	2015-11-18 19:27:38 +01:00
Jehan	eb727d3aca	Add automatic testing against every test file.	2015-11-18 18:18:27 +01:00
Jehan	f303a41735	Add Thai test file for UTF-8. Text from Thai Wikipedia: https://th.wikipedia.org/wiki/ยูนิโคด	2015-11-18 03:26:34 +01:00
Jehan	9d9257072a	s/windows-1255/WINDOWS-1255/ to follow iconv uppercase naming.	2015-11-18 03:21:34 +01:00

1 2 3

103 Commits