uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-02-08 10:47:01 +08:00

Author	SHA1	Message	Date
Jehan	d40e5868d5	script, src, test: adding Catalan support. For UTF-8, ISO-8859-1 and WINDOWS-1252 support. The test for UTF-8 and ISO-8859-1 is taken from 'Marmota' page on Wikipedia in Catalan. The test for WINDOWS-1252 is taken from the 'Unió_Europea' page. ISO-8859-1 and WINDOWS-1252 being very similar, regarding most letters (in particular the ones used in Catalan), I differentiated the test with a text containing the '€' symbol, which is on an unused spot in ISO-8859-1.	2022-12-20 01:46:15 +01:00
Jehan	0fe51d3851	Issue #21 : Greek CP737 support. It actually breaks "zh:big5" so I'm going to hold-off a bit. Adding more language and charset support is slowly starting to show the limitations of our legacy multi-byte charset supports, since I haven't really touched these since the original implementation of Mozilla. It might be time to start reviewing these parts of the code. The test file contents comes from 'Μαρμότα' page on Wikipedia in Greek (though since 2 letters are missing in this encoding, despite its popularity for Greek, I had to be careful in choosing pieces of text without such letters).	2022-12-18 22:33:12 +01:00
Jehan	abd123e07d	script, src, test: add Serbian support. For UTF-8, ISO-8859-5 and WINDOWS-1251. Test files' contents come from page 'Мрмот' on Wikipedia in Serbian.	2022-12-17 22:47:54 +01:00
Jehan	d00d4d52b7	src, script: add Macedonian support. For UTF-8, ISO-8859-5, WINDOWS-1251 and IBM855 encodings. Test files' contents come from page 'Хибернација' on Wikipedia in Macedonian.	2022-12-17 22:47:54 +01:00
Jehan	41d309e8a2	script, src: regenerate Russian models and add UTF-8/Russian support. This fixes the broken Russian test in Windows-1251 which once again gets a much better score with Russian. Also this adds UTF-8 support. Same as Bulgarian, I wonder why I had not regenerated this earlier. The new UTF-8 test comes from the 'Сурки' page of Wikipedia in Russian. Note that now this broke the test zh:gb18030 (the score for KOI8-R / ru (0.766388) beats GB18030 / zh (0.700000)). I think I'll have to look a bit closer at our GB18030 dedicated prober.	2022-12-17 21:41:11 +01:00
Jehan	60dcec8a82	script, src, test: add Ukrainian support. UTF-8 and Windows-1251 support for now. This actually breaks ru:windows-1251 test but same as Bulgarian, I never generated Russian models with my scripts, so the models we currently use are quite outdated. It will obviously be a lot better once we have new Russian models. The test file contents comes from 'Бабак' page on Wikipedia in Ukrainian.	2022-12-17 21:40:56 +01:00
Jehan	0fffc109b5	script, src, test: adding Belarusian support. Support for UTF-8, Windows-1251 and ISO-8859-5. The test contents comes from page 'Суркі' on Wikipedia in Belarusian.	2022-12-17 19:13:03 +01:00
Jehan	ffb94e4a9d	script, src, test: Bulgarian language models added. Not sure why we had the Bulgarian support but haven't recently updated it (i.e. never with the model generation script, or so it seems), especially with generic language models, allowing to have UTF-8/Bulgarian support. Maybe I tested it some time ago and it was getting bad results? Anyway now with all the recents updates on the confidence computation, I get very good detection scores. So adding support for UTF-8/Bulgarian and rebuilding other models too. Also adding a test for ISO-8859-5/Bulgarian (we already had support, but no test files). The 2 new test files are text from page 'Мармоти' on Wikipedia in Bulgarian language.	2022-12-17 18:41:00 +01:00
Jehan	6d31689632	test: adding 2 tests for Hebrew/IBM862 recognition. This is the same text, taken from this Wikipedia page, which was today's page of honor on Wikipedia in Hebrew: https://he.wikipedia.org/wiki/שתי מסכתות על ממשל מדיני I put it in 2 variants, since IBM862 can be used in logical and visual variants. The visual variant is just about inverting orders of letters (per lines, while lines stay in proper order), so that's what I did. Though note that the English title quoted in the text should likely not have been reverted, but it doesn't matter too much since anyway these are off-Hebrew alphabet and would trigger bad sequence score, whichever their order. So I didn't bother fixing these.	2022-12-16 23:35:17 +01:00
Jehan	127d7faf47	test: add ability to have several tests per charsets. While the expected charset name is still the first part of the test file (until the first point character), the test name is all but the last part (until the last point character). This will allow to have several test files for a single charset. In particular, I want 2 test files at least for Hebrew when it has a visual and logical variant. So I could call these "ibm862.visual.txt" and "ibm862.logical.txt" which both expect IBM862 as a result charset, but test names will "he:ibm862.visual" and he:ibm862.logical" respectively. Without this change, the test names would collide and CMake would refuse these.	2022-12-16 23:10:34 +01:00
Jehan	3a6806ab19	test: no:utf-8 is actually working now, after the last model script fix… … and rebuild of models. The scores are really not bad now, 0.896026 for Norwegian and 0.877947 for Danish. It looks like the last confidence computation changes I did are really giving fruits!	2022-12-15 15:11:17 +01:00
Jehan	598fe90c91	test: finally add English/UTF-8 test file. I had this test file locally for some time now, but it was always failing, and recognized as other languages until now. Thanks to the recent confidence improvements with new frequent/rare ratios, it is finally detected as English by uchardet!	2022-12-14 21:45:29 +01:00
Jehan	908f9b8ba7	src, test: rename s/uchardet_get_candidates/uchardet_get_n_candidates/. This was badly named as this function does not return candidates, but the number of candidates (to be actually used in other API).	2022-12-14 00:24:53 +01:00
Jehan	a916fb1c56	test: temporarily disable the Norwegian/UTF-8 test. It currently recognizes as Danish/UTF-8 with 0.958 score, though Norwegian/UTF-8 is indeed the second candidate with 0.911 (the third candidate is far behind, Swedish/UTF-8 with 0.815). Before wasting time tweaking models, there are more basic conceptual changes that I want to implement first (it might be enough to change the results!). So let's skip this test for now.	2022-12-14 00:24:53 +01:00
Jehan	a3ff09bece	test: improve test error output even more. Adding the found confidence, but also the confidence matched by the expected (lang, charset) couple, and its candidate order, if it even matched.	2022-12-14 00:24:53 +01:00
Jehan	c9446e540d	test: add stderr logging when a test fails. It allows to get some more info in Testing/Temporary/LastTest.log to debug detection issues.	2022-12-14 00:24:53 +01:00
Jehan	bffb7819d2	test: fix test binary build for Windows. realpath() doesn't exist on Windows. Replace it with _fullpath() which does the same thing, as far as I can see (at least for creating an absolute path, it doesn't seem to canonicalize the path, or the docs doesn't say it, yet since we are controlling the arguments from our CMake script, it's not a big problem anyway). This fixed the CI build for Windows failing with: > undefined reference to `realpath'	2022-12-14 00:24:53 +01:00
Jehan	a1b186fa8b	src: add Hindi/UTF-8 support.	2022-12-14 00:23:13 +01:00
Jehan	0d152ff430	src, test: fix the new Johab prober and add a test. This prober comes from MR !1 on the main branch though it was too agressive then and could not get merged. On the improved API branch, it doesn't detect other tests as Johab anymore. Also fixing it to work with the new API. Finally adding a Johab/ko unit test.	2022-12-14 00:23:13 +01:00
Jehan	cf0ffb0c55	test: 4 new tests for UTF-8. Taken from random pages for each of these languages. I now have a test for every 26 supported couple of (UTF-8, language). These are all working very fine and detected at the right encoding and language.	2022-12-14 00:23:13 +01:00
Jehan	1b5e68be00	test: update unit test to check detected languages. Excepting ASCII, UTF-16 and UTF-32 for which we don't detect languages yet.	2022-12-14 00:23:13 +01:00
Jehan	e7bf25ca08	test: fix test script to use the new API and get rid of build warning.	2022-12-14 00:23:13 +01:00
Jehan	388777be51	script, src, test: add IBM865 support for Danish. Newly added IBM865 charset (for Norwegian) can also be used for Danish By the way, I fixed `script/charsets/ibm865.py` as Danish uses the 'da' ISO 639-1 code by the way, not 'dk' (which is sometimes used for other codes for Denmark, such as ISO 3166 country code and internet TLD) but not for the language itself. For the test, adding some text from the top article of the day on the Danish Wikipedia, which was about Jimi Hendrix. And that's cool! 🎸 ;-)	2022-11-30 19:57:52 +01:00
Martin T. H. Sandsmark	c11c362b89	Add tests for norwegian	2022-11-30 19:09:21 +01:00
Jehan	2a04e57c8f	test: update the Maltese / ISO-8859-3 test file. Taken from the page: https://mt.wikipedia.org/wiki/Lingwa_Maltija The old test was fine but had some French words in it, which lowered the confidence for Maltese. Technically it should not be a huge issue in the end, i.e. that if there are enough actual Maltese words, the stats should still weigh in favor of Maltese likeness (which they mostly did anyway), but since I am making some other changes, this was just not enough. In particular I was changing some of the UTF-8 confidence logics and the file ended up detected as UTF-8 (even though it has illegal sequence and cannot be! Cf. #9). So the real long-term solution is to actually fix our UTF-8 detector, which I'll do at some point, but for the time being, let's have definite non-questionable Maltese in there to simplify testing at this early stage of uchardet rewriting.	2022-11-29 14:59:17 +01:00
Lucinda May Phipps	ef19faa8c5	Update uchardet-tests.c	2022-11-29 13:57:31 +00:00
Jehan	50bc02c0ff	Request C++11 standard project-wise and make it a strong requirement. It is unneeded to do it by target, using the globale property CMAKE_CXX_STANDARD instead. Also with CMAKE_CXX_STANDARD_REQUIRED, I make this a strong requirement. The documentation indeed states that the CXX_STANDARD "is treated as optional and may “decay” to a previous standard if the requested is not available". This means that uchardet will likely not be buildable with a compiler with no C++11 support. But I assume this is not a common situation, and probably we should not care about outdated compilers. I remain open to suggestions and disagreement on the topic obviously.	2017-05-28 15:43:44 +02:00
Jehan	1bf198cb0f	Make C++11 the standard used for uchardet. As discussed in bug 101032, it seems like the most common usage nowadays. Let's make a specific choice to avoid different behavior on different builds later on.	2017-05-28 15:32:06 +02:00
Jehan	6cf13f108b	test: output the test file path which we failed to open. Also properly free the string in such case.	2017-05-14 20:29:30 +02:00
Jehan	119fed7e8d	LangModels: add Swedish support. Encodings: ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Test text from https://sv.wikipedia.org/wiki/Mölle	2016-09-28 22:42:13 +02:00
Jehan	d62154bd6e	LangModels: add Slovene support. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and MAC-CENTRALEUROPE. Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet	2016-09-28 22:13:17 +02:00
Jehan	fbd2efdbe9	LangModels: Romanian support added. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852. Test texts from https://ro.wikipedia.org/wiki/Danemarca	2016-09-28 19:57:50 +02:00
Jehan	a7525b404d	LangModels: added support for Irish Gaelic. Encodings: ISO-8859-1, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Test text from: https://ga.wikipedia.org/wiki/Gluais_théarmaí_seoltóireachta	2016-09-27 00:49:05 +02:00
Jehan	a3a271dfd5	LangModels: Estonian models created. Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and Windows-1257. Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov Windows-1257 and ISO-8859-13 are very close so I added quotation marks (Jutumärgid) which are on codepoints only present in ISO-8859-13, making both encoding apart.	2016-09-27 00:14:29 +02:00
Jehan	3c6d31f5c2	LangModels: new Croatian models. Supports: ISO-8859-2, ISO-8859-13, ISO-8859-16, IBM852, Windows-1250 and MAC-CENTRALEUROPE. Test text from https://hr.wikipedia.org/wiki/Brekinja	2016-09-26 01:32:49 +02:00
Jehan	f262b1d65b	LangModels: add Italian support. Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added since other encoding end up similar as ISO-8859-1 for most common texts (i.e. glyphs used in Italian are on the same codepoints on these other encodings). Test text from https://it.wikipedia.org/wiki/Architettura_longobarda	2016-09-21 18:52:09 +02:00
Jehan	6bbe7da1ac	LangModels: add Finnish support. I built models for ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-13, ISO-8859-15 and WINDOWS-1252, which all contain Finnish letters. Nevertheless most texts in these encoding end up the same (same codepoints for the Finnish glyphs) so I keep only tests for ISO-8859-1 and UTF-8. Models for other encoding may still be useful when processing texts with some symbols, etc.	2016-09-21 18:27:39 +02:00
Jehan	3401ac70d0	LangModels: add Polish support. With the following encodings: ISO-8859-2, ISO-8859-13, ISO-8859-16, Windows-1250, IBM852, MAC-CENTRALEUROPE. Test text from https://pl.wikipedia.org/wiki/Zofia_Holszańska	2016-09-21 17:30:15 +02:00
Jehan	5f9ec3aef0	LangModels: add support for Slovak. Encodings are the same as Czech (Windows-1250, ISO-8859-2 and Mac-CentralEurope) since the resource I found indicate they used the same encodings historically. Also it is to be noted that the test examples' encoding were already properly detected through Czech's models so the languages are definitely very close, even statistically. Nevertheless adding the right models will work better and these get better scores. This will take all its meaning when uchardet will also be used as a language detector (in some not-too-far future, hopefully!). Test text taken from: https://sk.wikipedia.org/wiki/Jupiter	2016-09-21 13:42:20 +02:00
Jehan	2c752dbbe5	test: adding test files for Czech. Text taken from: https://cs.wikipedia.org/wiki/Ledňáček_říční	2016-09-21 03:44:22 +02:00
Jehan	2700cf3a83	LangModels: support for Maltese / ISO-8859-3. Test text from https://mt.wikipedia.org/wiki/Franza.	2016-09-21 02:11:31 +02:00
Jehan	b7aebfdfda	LangModels: add support for Latvian \| Lithuanian / ISO-8859-4 \| ISO-8859-10. Just realizing that these 2 language can also be encoded with these charsets (even though ISO-8859-13 would appear to be more common… maybe?). Anyway now the models are updated and can recognize texts using these encoding for these languages. Added some test files as well, which work great.	2016-09-21 00:27:16 +02:00
Jehan	e138839f07	LangModels: add support for Portuguese / ISO-8859-1. I actually added also couples with ISO-8859-9, ISO-8859-15 and Windows-1252. Nevertheless there are no differences on the main characters related to Portuguese so differences will hardly be made and detection will usually return ISO-8859-1 only.	2016-09-21 00:01:07 +02:00
Jehan	ea2f4dd40f	LangModels: new support for Latvian / ISO-8859-13. Test text extracted from: https://lv.wikipedia.org/wiki/Vinsents_van_Gogs	2016-09-20 23:29:53 +02:00
Jehan	7cb3dd9ddd	LangModels: add support for Lithuanian / ISO-8859-13. Test text extracted from https://lt.wikipedia.org/wiki/Vincent_van_Gogh.	2016-09-20 23:09:24 +02:00
Jehan	2a559e7b52	README, test: update README and rename EUC-KR test to UHC.	2016-09-19 01:44:32 +02:00
Ilya Tumaykin	29f18210b1	cmake: hardcode less	2016-03-22 01:23:04 +03:00
Ilya Tumaykin	b44be77be6	cmake: uniform indent everywhere Indent with tabs, remove leading/trailing blank lines and spaces.	2016-03-21 01:07:41 +03:00
Jehan	923d264470	LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15). Test for ISO-8859-1 is disabled for now since the difference is not big enough, as for characters used in Danish, between ISO-8859-1 and ISO-8859-15. Therefore the first to be declared "wins". Let's see to improve this later. Test contents from: https://da.wikipedia.org/wiki/Eurosymbol https://da.wikipedia.org/wiki/Dansk_%28sprog%29	2016-02-19 19:10:41 +01:00
Jehan	98b5e52252	LangModels: add VISCII encoding support and retrain Vietnamese model.	2016-02-13 03:51:18 +01:00

1 2

90 Commits