uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-02-15 06:40:02 +08:00

Author	SHA1	Message	Date
Jehan	388777be51	script, src, test: add IBM865 support for Danish. Newly added IBM865 charset (for Norwegian) can also be used for Danish By the way, I fixed `script/charsets/ibm865.py` as Danish uses the 'da' ISO 639-1 code by the way, not 'dk' (which is sometimes used for other codes for Denmark, such as ISO 3166 country code and internet TLD) but not for the language itself. For the test, adding some text from the top article of the day on the Danish Wikipedia, which was about Jimi Hendrix. And that's cool! 🎸 ;-)	2022-11-30 19:57:52 +01:00
Martin T. H. Sandsmark	c11c362b89	Add tests for norwegian	2022-11-30 19:09:21 +01:00
Jehan	2a04e57c8f	test: update the Maltese / ISO-8859-3 test file. Taken from the page: https://mt.wikipedia.org/wiki/Lingwa_Maltija The old test was fine but had some French words in it, which lowered the confidence for Maltese. Technically it should not be a huge issue in the end, i.e. that if there are enough actual Maltese words, the stats should still weigh in favor of Maltese likeness (which they mostly did anyway), but since I am making some other changes, this was just not enough. In particular I was changing some of the UTF-8 confidence logics and the file ended up detected as UTF-8 (even though it has illegal sequence and cannot be! Cf. #9). So the real long-term solution is to actually fix our UTF-8 detector, which I'll do at some point, but for the time being, let's have definite non-questionable Maltese in there to simplify testing at this early stage of uchardet rewriting.	2022-11-29 14:59:17 +01:00
Lucinda May Phipps	ef19faa8c5	Update uchardet-tests.c	2022-11-29 13:57:31 +00:00
Jehan	50bc02c0ff	Request C++11 standard project-wise and make it a strong requirement. It is unneeded to do it by target, using the globale property CMAKE_CXX_STANDARD instead. Also with CMAKE_CXX_STANDARD_REQUIRED, I make this a strong requirement. The documentation indeed states that the CXX_STANDARD "is treated as optional and may “decay” to a previous standard if the requested is not available". This means that uchardet will likely not be buildable with a compiler with no C++11 support. But I assume this is not a common situation, and probably we should not care about outdated compilers. I remain open to suggestions and disagreement on the topic obviously.	2017-05-28 15:43:44 +02:00
Jehan	1bf198cb0f	Make C++11 the standard used for uchardet. As discussed in bug 101032, it seems like the most common usage nowadays. Let's make a specific choice to avoid different behavior on different builds later on.	2017-05-28 15:32:06 +02:00
Jehan	6cf13f108b	test: output the test file path which we failed to open. Also properly free the string in such case.	2017-05-14 20:29:30 +02:00
Jehan	119fed7e8d	LangModels: add Swedish support. Encodings: ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Test text from https://sv.wikipedia.org/wiki/Mölle	2016-09-28 22:42:13 +02:00
Jehan	d62154bd6e	LangModels: add Slovene support. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and MAC-CENTRALEUROPE. Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet	2016-09-28 22:13:17 +02:00
Jehan	fbd2efdbe9	LangModels: Romanian support added. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852. Test texts from https://ro.wikipedia.org/wiki/Danemarca	2016-09-28 19:57:50 +02:00
Jehan	a7525b404d	LangModels: added support for Irish Gaelic. Encodings: ISO-8859-1, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Test text from: https://ga.wikipedia.org/wiki/Gluais_théarmaí_seoltóireachta	2016-09-27 00:49:05 +02:00
Jehan	a3a271dfd5	LangModels: Estonian models created. Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and Windows-1257. Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov Windows-1257 and ISO-8859-13 are very close so I added quotation marks (Jutumärgid) which are on codepoints only present in ISO-8859-13, making both encoding apart.	2016-09-27 00:14:29 +02:00
Jehan	3c6d31f5c2	LangModels: new Croatian models. Supports: ISO-8859-2, ISO-8859-13, ISO-8859-16, IBM852, Windows-1250 and MAC-CENTRALEUROPE. Test text from https://hr.wikipedia.org/wiki/Brekinja	2016-09-26 01:32:49 +02:00
Jehan	f262b1d65b	LangModels: add Italian support. Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added since other encoding end up similar as ISO-8859-1 for most common texts (i.e. glyphs used in Italian are on the same codepoints on these other encodings). Test text from https://it.wikipedia.org/wiki/Architettura_longobarda	2016-09-21 18:52:09 +02:00
Jehan	6bbe7da1ac	LangModels: add Finnish support. I built models for ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-13, ISO-8859-15 and WINDOWS-1252, which all contain Finnish letters. Nevertheless most texts in these encoding end up the same (same codepoints for the Finnish glyphs) so I keep only tests for ISO-8859-1 and UTF-8. Models for other encoding may still be useful when processing texts with some symbols, etc.	2016-09-21 18:27:39 +02:00
Jehan	3401ac70d0	LangModels: add Polish support. With the following encodings: ISO-8859-2, ISO-8859-13, ISO-8859-16, Windows-1250, IBM852, MAC-CENTRALEUROPE. Test text from https://pl.wikipedia.org/wiki/Zofia_Holszańska	2016-09-21 17:30:15 +02:00
Jehan	5f9ec3aef0	LangModels: add support for Slovak. Encodings are the same as Czech (Windows-1250, ISO-8859-2 and Mac-CentralEurope) since the resource I found indicate they used the same encodings historically. Also it is to be noted that the test examples' encoding were already properly detected through Czech's models so the languages are definitely very close, even statistically. Nevertheless adding the right models will work better and these get better scores. This will take all its meaning when uchardet will also be used as a language detector (in some not-too-far future, hopefully!). Test text taken from: https://sk.wikipedia.org/wiki/Jupiter	2016-09-21 13:42:20 +02:00
Jehan	2c752dbbe5	test: adding test files for Czech. Text taken from: https://cs.wikipedia.org/wiki/Ledňáček_říční	2016-09-21 03:44:22 +02:00
Jehan	2700cf3a83	LangModels: support for Maltese / ISO-8859-3. Test text from https://mt.wikipedia.org/wiki/Franza.	2016-09-21 02:11:31 +02:00
Jehan	b7aebfdfda	LangModels: add support for Latvian \| Lithuanian / ISO-8859-4 \| ISO-8859-10. Just realizing that these 2 language can also be encoded with these charsets (even though ISO-8859-13 would appear to be more common… maybe?). Anyway now the models are updated and can recognize texts using these encoding for these languages. Added some test files as well, which work great.	2016-09-21 00:27:16 +02:00
Jehan	e138839f07	LangModels: add support for Portuguese / ISO-8859-1. I actually added also couples with ISO-8859-9, ISO-8859-15 and Windows-1252. Nevertheless there are no differences on the main characters related to Portuguese so differences will hardly be made and detection will usually return ISO-8859-1 only.	2016-09-21 00:01:07 +02:00
Jehan	ea2f4dd40f	LangModels: new support for Latvian / ISO-8859-13. Test text extracted from: https://lv.wikipedia.org/wiki/Vinsents_van_Gogs	2016-09-20 23:29:53 +02:00
Jehan	7cb3dd9ddd	LangModels: add support for Lithuanian / ISO-8859-13. Test text extracted from https://lt.wikipedia.org/wiki/Vincent_van_Gogh.	2016-09-20 23:09:24 +02:00
Jehan	2a559e7b52	README, test: update README and rename EUC-KR test to UHC.	2016-09-19 01:44:32 +02:00
Ilya Tumaykin	29f18210b1	cmake: hardcode less	2016-03-22 01:23:04 +03:00
Ilya Tumaykin	b44be77be6	cmake: uniform indent everywhere Indent with tabs, remove leading/trailing blank lines and spaces.	2016-03-21 01:07:41 +03:00
Jehan	923d264470	LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15). Test for ISO-8859-1 is disabled for now since the difference is not big enough, as for characters used in Danish, between ISO-8859-1 and ISO-8859-15. Therefore the first to be declared "wins". Let's see to improve this later. Test contents from: https://da.wikipedia.org/wiki/Eurosymbol https://da.wikipedia.org/wiki/Dansk_%28sprog%29	2016-02-19 19:10:41 +01:00
Jehan	98b5e52252	LangModels: add VISCII encoding support and retrain Vietnamese model.	2016-02-13 03:51:18 +01:00
Jehan	178c6119b8	LangModels: add Windows-1258 support for Vietnamese. I was planning on adding VISCII support as well, but Python encode() method does not have any support for it apparently, so I cannot generate the proper statistics data with the current version of the string.	2016-02-13 02:32:57 +01:00
Jehan	706023139c	tests: add test files for Arabic. Text taken from: https://ar.wikipedia.org/wiki/%D9%88%D9%8A%D9%86%D8%AF%D9%88%D8%B2-1256	2015-12-13 18:42:59 +01:00
Jehan	ad2f7212e2	LangModels: retraining Greek models with my training script. This fixes our Greek/Windows-1253 test.	2015-12-13 18:02:11 +01:00
Jehan	1b4c62ac21	tests: test files for Spanish. I disable only ISO-8859-15 which is similar to ISO-8859-1 for all Spanish letters. Unfortunately illegal codepoints are similar too. Difference should likely be done on symbols (like the euro symbol) but our current algorithm does nothing about this for charset comparison. Text from https://es.wikipedia.org/wiki/España	2015-12-12 18:55:43 +01:00
Jehan	2bade77bf9	tests: update Window-1250 test file for Hungarian. ISO-8859-2 and Windows-1250 are absolutely similar for all letters in the Hungarian alphabet. So for most texts, it is not an error to return one charset or the other. What could make the difference is for instance that Windows-1250 has some symbols where ISO-8859-2 has control characters, like quotes, dashes, the euro symbol… Since control characters have a negative impact on confidence now, texts with such symbols would tend towards Windows-1250 decision. The new test file has such quote symbols.	2015-12-12 18:12:08 +01:00
Jehan	fe7bf3e994	test: update UTF-16 and UTF-32 tests after label changing.	2015-12-04 19:46:51 +01:00
Jehan	fb3c47a073	LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models. ISO-8859-11 is basically exactly identical to TIS-620, with the added non-breaking space character. Basically our detection will always return TIS-620 except for exceptional cases when a text has a non-breaking space.	2015-12-04 03:14:52 +01:00
Jehan	5ee1c3ee39	LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9.	2015-12-04 02:35:09 +01:00
Jehan	f0e122b506	LangModels: add Esperanto ISO-8859-3 language model.	2015-12-04 01:35:56 +01:00
Jehan	aa587a64bd	LangModels: adding German models for ISO-8859-1 and Windows-1252.	2015-12-03 23:58:41 +01:00
Jehan	5d3fb3dc2f	test: add a Windows-1252 French test. Text from https://fr.wikipedia.org/wiki/Œuf_(cuisine)	2015-12-03 21:20:15 +01:00
Jehan	15afc5c593	test: add a Hungarian Windows-1250 test but skip it for now. Text from: https://hu.wikipedia.org/wiki/Magyar_nyelv	2015-12-03 21:18:55 +01:00
Jehan	683255278d	Re-enable Hungarian language models. Now that we have at least one model for ISO-8859-1, the risk of detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.	2015-12-02 22:24:36 +01:00
Jehan	f4f9fc3f28	test: reenable Windows-1251 test for Russian. Commit 4f1c3ff actually fixed it!	2015-12-02 21:53:27 +01:00
Jehan	9dd6b34e93	test: add French UTF-8 test. Text from: https://fr.wikipedia.org/wiki/UTF-8	2015-11-30 20:03:33 +01:00
Jehan	04f9309932	tests: update ISO-8859-15 French test file. Previous technical text about charsets themselves were not relevant to identify a language. In particular the special characters different between ISO-8859-1 and ISO-8859-15 were used by themselves, out of a char sequence context. Therefore without language understanding, they could have as well been representing the ISO-8859-15 letters or the ISO-8859-1 symbols at the corresponding codepoints. Replacing with text from this Wikipedia page: https://fr.wikipedia.org/wiki/Œuf_(cuisine) This uses some of these same characters (in particular 'œ') but in contextual character sequences, making it relevant for our algorithm.	2015-11-30 00:19:15 +01:00
Jehan	a8e9de307b	Add UTF-16 test files without BOM... ... and disable the tests for now for these since uchardet is not able to detect UTF-16 without a BOM as for now.	2015-11-28 19:50:18 +01:00
Jehan	573b303fe3	Add an ASCII test file for English... ... with escape characters because even with ESC, a file is ASCII unless proven otherwise.	2015-11-28 17:49:13 +01:00
Jehan	50588ba375	Add a ISO-8859-15 test file for French.	2015-11-28 02:18:57 +01:00
Jehan	005fd98086	Add initial support for French with ISO-8859-1 and ISO-8859-15. Mostly generated with a script from Wikipedia data (only the typical positive ratio is slightly modified). This is a first test before adding my generating script to the main tree.	2015-11-28 02:14:39 +01:00
Jehan	942ac05ff5	Add some Russian test files. Texts from: IBM855: https://ru.wikipedia.org/wiki/CP855 IBM866: https://ru.wikipedia.org/wiki/Альтернативная_кодировка MAC-CYRILLIC: https://ru.wikipedia.org/wiki/MacCyrillic	2015-11-27 18:17:20 +01:00
Jehan	7fa0fefef8	Add UTF-16 and UTF-32 test files in French, with BOM. Unfortunately uchardet currently seems unable to detect UTF-16/32 text without a BOM.	2015-11-26 02:45:00 +01:00

1 2

68 Commits