uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-08 01:36:41 +08:00

Author	SHA1	Message	Date
Jehan	7875272a8c	script, src, test: new Georgian support. For charsets UTF-8, GEORGIAN-ACADEMY and GEORGIAN-PS. The 2 GEORGIAN-* sets were generated thanks to the new create-table.py script. Test text comes from page 'ვირზაზუნა' page of Wikipedia in Georgian.	2022-12-20 14:28:29 +01:00
Jehan	d40e5868d5	script, src, test: adding Catalan support. For UTF-8, ISO-8859-1 and WINDOWS-1252 support. The test for UTF-8 and ISO-8859-1 is taken from 'Marmota' page on Wikipedia in Catalan. The test for WINDOWS-1252 is taken from the 'Unió_Europea' page. ISO-8859-1 and WINDOWS-1252 being very similar, regarding most letters (in particular the ones used in Catalan), I differentiated the test with a text containing the '€' symbol, which is on an unused spot in ISO-8859-1.	2022-12-20 01:46:15 +01:00
Jehan	0fe51d3851	Issue #21 : Greek CP737 support. It actually breaks "zh:big5" so I'm going to hold-off a bit. Adding more language and charset support is slowly starting to show the limitations of our legacy multi-byte charset supports, since I haven't really touched these since the original implementation of Mozilla. It might be time to start reviewing these parts of the code. The test file contents comes from 'Μαρμότα' page on Wikipedia in Greek (though since 2 letters are missing in this encoding, despite its popularity for Greek, I had to be careful in choosing pieces of text without such letters).	2022-12-18 22:33:12 +01:00
Jehan	d6cab28fb4	README: missing UTF-8 support listed on several languages.	2022-12-17 23:00:26 +01:00
Jehan	abd123e07d	script, src, test: add Serbian support. For UTF-8, ISO-8859-5 and WINDOWS-1251. Test files' contents come from page 'Мрмот' on Wikipedia in Serbian.	2022-12-17 22:47:54 +01:00
Jehan	d00d4d52b7	src, script: add Macedonian support. For UTF-8, ISO-8859-5, WINDOWS-1251 and IBM855 encodings. Test files' contents come from page 'Хибернација' on Wikipedia in Macedonian.	2022-12-17 22:47:54 +01:00
Jehan	41d309e8a2	script, src: regenerate Russian models and add UTF-8/Russian support. This fixes the broken Russian test in Windows-1251 which once again gets a much better score with Russian. Also this adds UTF-8 support. Same as Bulgarian, I wonder why I had not regenerated this earlier. The new UTF-8 test comes from the 'Сурки' page of Wikipedia in Russian. Note that now this broke the test zh:gb18030 (the score for KOI8-R / ru (0.766388) beats GB18030 / zh (0.700000)). I think I'll have to look a bit closer at our GB18030 dedicated prober.	2022-12-17 21:41:11 +01:00
Jehan	60dcec8a82	script, src, test: add Ukrainian support. UTF-8 and Windows-1251 support for now. This actually breaks ru:windows-1251 test but same as Bulgarian, I never generated Russian models with my scripts, so the models we currently use are quite outdated. It will obviously be a lot better once we have new Russian models. The test file contents comes from 'Бабак' page on Wikipedia in Ukrainian.	2022-12-17 21:40:56 +01:00
Jehan	0fffc109b5	script, src, test: adding Belarusian support. Support for UTF-8, Windows-1251 and ISO-8859-5. The test contents comes from page 'Суркі' on Wikipedia in Belarusian.	2022-12-17 19:13:03 +01:00
Jehan	ffb94e4a9d	script, src, test: Bulgarian language models added. Not sure why we had the Bulgarian support but haven't recently updated it (i.e. never with the model generation script, or so it seems), especially with generic language models, allowing to have UTF-8/Bulgarian support. Maybe I tested it some time ago and it was getting bad results? Anyway now with all the recents updates on the confidence computation, I get very good detection scores. So adding support for UTF-8/Bulgarian and rebuilding other models too. Also adding a test for ISO-8859-5/Bulgarian (we already had support, but no test files). The 2 new test files are text from page 'Мармоти' on Wikipedia in Bulgarian language.	2022-12-17 18:41:00 +01:00
Jehan	0974920bdd	Issue #22 : Hebrew CP862 support. Added in both visual and logical order since Wikipedia says: > Hebrew text encoded using code page 862 was usually stored in visual > order; nevertheless, a few DOS applications, notably a word processor > named EinsteinWriter, stored Hebrew in logical order. I am not using the nsHebrewProber wrapper (nameProber) for this new support, because I am really unsure this is of any use. Our statistical code based on letter and sequence usage should be more than enough to detect both variants of Hebrew encoding already, and my testing show that so far (with pretty outstanding score on actual Hebrew tests while all the other probers return bad scores). This will have to be studied a bit more later and maybe the whole nsHebrewProber might be deleted, even for Windows-1255 charset. I'm also cleaning a bit nsSBCSGroupProber::nsSBCSGroupProber() code by incrementing a single index, instead of maintaining the indexes by hand (otherwise each time we add probers in the middle, to keep them logically gathered by languages, we have to manually increment dozens of following probers).	2022-12-16 23:27:52 +01:00
Jehan	c782177a8d	README: fix a duplicate.	2022-12-14 00:24:53 +01:00
Jehan	3ca49e2bc1	Update README.	2022-12-14 00:24:50 +01:00
Jehan	2f5c24006e	README, doc: some README and release procedure updates.	2022-12-08 22:34:22 +01:00
Jehan	ae6302a016	Release: version 0.0.8.	2022-12-08 21:52:25 +01:00
Jehan	c218a3ccd6	README: add a section about CMake exported targets. Since it's a new feature, we may as well write about it, even though I would personally not recommend this in favor of more standard and generic pkg-config (which is not dependent on which build system we are using ourselves).	2022-11-30 23:48:16 +01:00
Jehan	6196f86c46	README: update with newly added (lang, charset) couples.	2022-11-30 20:06:52 +01:00
myd7349	143b3fe513	README: update libchardet repository link	2022-08-01 19:38:19 +08:00
Aaron Madlon-Kay	6f38ab95f5	Mention MacPorts in readme	2021-01-27 06:57:58 +00:00
Jehan	c8a3572cca	Issue #17 : update README. Replace the old link to the science paper by one on archive-mozilla website. Remove the original source link as I can't find any archived version of it (even on archive.org, only the folder structure is saved, not actual files themselves, so it's useless). Also add some history, which is probably a nice touch. Add a link to crossroad to help people who'd want to cross-compile uchardet. Finally add the R binding by Artem Klevtsov and QtAV as reported.	2020-04-29 16:20:00 +02:00
Jehan	59f68dbe57	Release: version 0.0.7	2020-04-23 11:48:58 +02:00
Jehan	60bf53c81e	README: update to Gitlab links. Freedesktop moved its infrastructure to Gitlab a while ago.	2020-04-22 00:33:48 +02:00
Jehan	0cfb75724a	README: some small updates.	2020-04-22 00:17:23 +02:00
Jehan	bdfd6116a9	Add a mention about fd.o code of conduct.	2018-09-26 15:12:25 +02:00
Jehan	95872ef41c	Adding some information about building for Windows.	2017-12-26 03:37:42 +01:00
Jehan	056a5a6e51	README: add some applications having uchardet as dependency. There are likely more (and I know some are planning support) but these are the ones I know of and with support already in.	2017-09-21 00:06:03 +02:00
Jehan	d9d014742a	README: Gentoo also has a uchardet package. And it is up-to-date with upstream URL at Freedesktop! Good!	2017-05-28 21:13:59 +02:00
Jehan	d90d01bc9e	README: adding a flatpak-builder manifest example. Thanks to Sébastien Wilmet for the example.	2017-03-24 23:22:40 +01:00
Jehan	119fed7e8d	LangModels: add Swedish support. Encodings: ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Test text from https://sv.wikipedia.org/wiki/Mölle	2016-09-28 22:42:13 +02:00
Jehan	d62154bd6e	LangModels: add Slovene support. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and MAC-CENTRALEUROPE. Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet	2016-09-28 22:13:17 +02:00
Jehan	fbd2efdbe9	LangModels: Romanian support added. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852. Test texts from https://ro.wikipedia.org/wiki/Danemarca	2016-09-28 19:57:50 +02:00
Jehan	a7525b404d	LangModels: added support for Irish Gaelic. Encodings: ISO-8859-1, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Test text from: https://ga.wikipedia.org/wiki/Gluais_théarmaí_seoltóireachta	2016-09-27 00:49:05 +02:00
Jehan	a3a271dfd5	LangModels: Estonian models created. Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and Windows-1257. Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov Windows-1257 and ISO-8859-13 are very close so I added quotation marks (Jutumärgid) which are on codepoints only present in ISO-8859-13, making both encoding apart.	2016-09-27 00:14:29 +02:00
Jehan	3c6d31f5c2	LangModels: new Croatian models. Supports: ISO-8859-2, ISO-8859-13, ISO-8859-16, IBM852, Windows-1250 and MAC-CENTRALEUROPE. Test text from https://hr.wikipedia.org/wiki/Brekinja	2016-09-26 01:32:49 +02:00
Jehan	f262b1d65b	LangModels: add Italian support. Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added since other encoding end up similar as ISO-8859-1 for most common texts (i.e. glyphs used in Italian are on the same codepoints on these other encodings). Test text from https://it.wikipedia.org/wiki/Architettura_longobarda	2016-09-21 18:52:09 +02:00
Jehan	87d0c16e0e	README: add Finnish support.	2016-09-21 18:35:26 +02:00
Jehan	ac4aa94b73	README: add Polish support… … and update "Mac-CentralEurope" into "MAC-CENTRALEUROPE" (as in iconv).	2016-09-21 17:38:22 +02:00
Jehan	f314b76c0a	README: add Slovak support.	2016-09-21 13:42:31 +02:00
Jehan	5680cba0b8	README: adding Czech and Maltese support information.	2016-09-21 03:45:40 +02:00
Jehan	d810f1175b	README: update Latvian and Lithuanian support. Uchardet now recognizes these langs also with ISO-8859-4 and ISO-8859-10.	2016-09-21 00:35:23 +02:00
Jehan	9f7ed67166	README: add info on Portuguese support.	2016-09-21 00:05:12 +02:00
Jehan	e98d257ec4	README: add ISO-8859-13 for Latvian and Lithuanian support.	2016-09-20 23:35:12 +02:00
Jehan	2a559e7b52	README, test: update README and rename EUC-KR test to UHC.	2016-09-19 01:44:32 +02:00
Jehan	8a8d6b654c	Release: version 0.0.6.	2016-07-20 01:47:50 +02:00
Jehan	771d78b7df	Update the URL links: uchardet is now a freedesktop project.	2016-07-20 01:47:50 +02:00
Jehan	20eb319359	README: make the licenses as a list. This was breaking as markdown by not creating linefeeds.	2016-07-20 00:21:07 +02:00
Jehan	602c1ab0fc	README, COPYING: adding links and text of licenses GPL 2.0 and LGPL 2.1. Thanks to Ilya Tumaykin for reporting the missing info.	2016-06-04 14:21:38 +02:00
Jehan	d5dba26e04	README: add Danish support for 3 charsets.	2016-02-19 19:11:56 +01:00
Jehan	1694999bce	README: update with VISCII support.	2016-02-13 03:52:06 +01:00
Jehan	178c6119b8	LangModels: add Windows-1258 support for Vietnamese. I was planning on adding VISCII support as well, but Python encode() method does not have any support for it apparently, so I cannot generate the proper statistics data with the current version of the string.	2016-02-13 02:32:57 +01:00

1 2

80 Commits