uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-12 06:30:05 +08:00

Author	SHA1	Message	Date
Ilya Tumaykin	29f18210b1	cmake: hardcode less	2016-03-22 01:23:04 +03:00
Ilya Tumaykin	b44be77be6	cmake: uniform indent everywhere Indent with tabs, remove leading/trailing blank lines and spaces.	2016-03-21 01:07:41 +03:00
Jehan	923d264470	LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15). Test for ISO-8859-1 is disabled for now since the difference is not big enough, as for characters used in Danish, between ISO-8859-1 and ISO-8859-15. Therefore the first to be declared "wins". Let's see to improve this later. Test contents from: https://da.wikipedia.org/wiki/Eurosymbol https://da.wikipedia.org/wiki/Dansk_%28sprog%29	2016-02-19 19:10:41 +01:00
Jehan	ad2f7212e2	LangModels: retraining Greek models with my training script. This fixes our Greek/Windows-1253 test.	2015-12-13 18:02:11 +01:00
Jehan	1b4c62ac21	tests: test files for Spanish. I disable only ISO-8859-15 which is similar to ISO-8859-1 for all Spanish letters. Unfortunately illegal codepoints are similar too. Difference should likely be done on symbols (like the euro symbol) but our current algorithm does nothing about this for charset comparison. Text from https://es.wikipedia.org/wiki/España	2015-12-12 18:55:43 +01:00
Jehan	2bade77bf9	tests: update Window-1250 test file for Hungarian. ISO-8859-2 and Windows-1250 are absolutely similar for all letters in the Hungarian alphabet. So for most texts, it is not an error to return one charset or the other. What could make the difference is for instance that Windows-1250 has some symbols where ISO-8859-2 has control characters, like quotes, dashes, the euro symbol… Since control characters have a negative impact on confidence now, texts with such symbols would tend towards Windows-1250 decision. The new test file has such quote symbols.	2015-12-12 18:12:08 +01:00
Jehan	15afc5c593	test: add a Hungarian Windows-1250 test but skip it for now. Text from: https://hu.wikipedia.org/wiki/Magyar_nyelv	2015-12-03 21:18:55 +01:00
Jehan	683255278d	Re-enable Hungarian language models. Now that we have at least one model for ISO-8859-1, the risk of detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.	2015-12-02 22:24:36 +01:00
Jehan	f4f9fc3f28	test: reenable Windows-1251 test for Russian. Commit 4f1c3ff actually fixed it!	2015-12-02 21:53:27 +01:00
Jehan	a8e9de307b	Add UTF-16 test files without BOM... ... and disable the tests for now for these since uchardet is not able to detect UTF-16 without a BOM as for now.	2015-11-28 19:50:18 +01:00
Jehan	005fd98086	Add initial support for French with ISO-8859-1 and ISO-8859-15. Mostly generated with a script from Wikipedia data (only the typical positive ratio is slightly modified). This is a first test before adding my generating script to the main tree.	2015-11-28 02:14:39 +01:00
Jehan	5dcff7b241	Hide away tests known to fail. Some charsets are simply not supported (ex: fr:iso-8859-1), some are temporarily deactivated (ex: hu:iso-8859-2) and some are wrongly detected as closely related charsets. These were broken (or not efficient) from the start, and there is no need to pollute the `make test` output with these, which may make us miss when actual regressions will occur. So let's hide these away for now until we can improve the situation.	2015-11-18 20:02:58 +01:00
Jehan	4b38e68aa2	CMake tests: separate the lang and charset with colon... ... rather than an hyphen. It makes it easier to read.	2015-11-18 19:42:35 +01:00
Jehan	eb727d3aca	Add automatic testing against every test file.	2015-11-18 18:18:27 +01:00

14 Commits