uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-06 16:56:40 +08:00

Author	SHA1	Message	Date
Jehan	0974920bdd	Issue #22 : Hebrew CP862 support. Added in both visual and logical order since Wikipedia says: > Hebrew text encoded using code page 862 was usually stored in visual > order; nevertheless, a few DOS applications, notably a word processor > named EinsteinWriter, stored Hebrew in logical order. I am not using the nsHebrewProber wrapper (nameProber) for this new support, because I am really unsure this is of any use. Our statistical code based on letter and sequence usage should be more than enough to detect both variants of Hebrew encoding already, and my testing show that so far (with pretty outstanding score on actual Hebrew tests while all the other probers return bad scores). This will have to be studied a bit more later and maybe the whole nsHebrewProber might be deleted, even for Windows-1255 charset. I'm also cleaning a bit nsSBCSGroupProber::nsSBCSGroupProber() code by incrementing a single index, instead of maintaining the indexes by hand (otherwise each time we add probers in the middle, to keep them logically gathered by languages, we have to manually increment dozens of following probers).	2022-12-16 23:27:52 +01:00
Jehan	c782177a8d	README: fix a duplicate.	2022-12-14 00:24:53 +01:00
Jehan	3ca49e2bc1	Update README.	2022-12-14 00:24:50 +01:00
Jehan	2f5c24006e	README, doc: some README and release procedure updates.	2022-12-08 22:34:22 +01:00
Jehan	ae6302a016	Release: version 0.0.8.	2022-12-08 21:52:25 +01:00
Jehan	c218a3ccd6	README: add a section about CMake exported targets. Since it's a new feature, we may as well write about it, even though I would personally not recommend this in favor of more standard and generic pkg-config (which is not dependent on which build system we are using ourselves).	2022-11-30 23:48:16 +01:00
Jehan	6196f86c46	README: update with newly added (lang, charset) couples.	2022-11-30 20:06:52 +01:00
myd7349	143b3fe513	README: update libchardet repository link	2022-08-01 19:38:19 +08:00
Aaron Madlon-Kay	6f38ab95f5	Mention MacPorts in readme	2021-01-27 06:57:58 +00:00
Jehan	c8a3572cca	Issue #17 : update README. Replace the old link to the science paper by one on archive-mozilla website. Remove the original source link as I can't find any archived version of it (even on archive.org, only the folder structure is saved, not actual files themselves, so it's useless). Also add some history, which is probably a nice touch. Add a link to crossroad to help people who'd want to cross-compile uchardet. Finally add the R binding by Artem Klevtsov and QtAV as reported.	2020-04-29 16:20:00 +02:00
Jehan	59f68dbe57	Release: version 0.0.7	2020-04-23 11:48:58 +02:00
Jehan	60bf53c81e	README: update to Gitlab links. Freedesktop moved its infrastructure to Gitlab a while ago.	2020-04-22 00:33:48 +02:00
Jehan	0cfb75724a	README: some small updates.	2020-04-22 00:17:23 +02:00
Jehan	bdfd6116a9	Add a mention about fd.o code of conduct.	2018-09-26 15:12:25 +02:00
Jehan	95872ef41c	Adding some information about building for Windows.	2017-12-26 03:37:42 +01:00
Jehan	056a5a6e51	README: add some applications having uchardet as dependency. There are likely more (and I know some are planning support) but these are the ones I know of and with support already in.	2017-09-21 00:06:03 +02:00
Jehan	d9d014742a	README: Gentoo also has a uchardet package. And it is up-to-date with upstream URL at Freedesktop! Good!	2017-05-28 21:13:59 +02:00
Jehan	d90d01bc9e	README: adding a flatpak-builder manifest example. Thanks to Sébastien Wilmet for the example.	2017-03-24 23:22:40 +01:00
Jehan	119fed7e8d	LangModels: add Swedish support. Encodings: ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Test text from https://sv.wikipedia.org/wiki/Mölle	2016-09-28 22:42:13 +02:00
Jehan	d62154bd6e	LangModels: add Slovene support. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and MAC-CENTRALEUROPE. Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet	2016-09-28 22:13:17 +02:00
Jehan	fbd2efdbe9	LangModels: Romanian support added. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852. Test texts from https://ro.wikipedia.org/wiki/Danemarca	2016-09-28 19:57:50 +02:00
Jehan	a7525b404d	LangModels: added support for Irish Gaelic. Encodings: ISO-8859-1, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Test text from: https://ga.wikipedia.org/wiki/Gluais_théarmaí_seoltóireachta	2016-09-27 00:49:05 +02:00
Jehan	a3a271dfd5	LangModels: Estonian models created. Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and Windows-1257. Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov Windows-1257 and ISO-8859-13 are very close so I added quotation marks (Jutumärgid) which are on codepoints only present in ISO-8859-13, making both encoding apart.	2016-09-27 00:14:29 +02:00
Jehan	3c6d31f5c2	LangModels: new Croatian models. Supports: ISO-8859-2, ISO-8859-13, ISO-8859-16, IBM852, Windows-1250 and MAC-CENTRALEUROPE. Test text from https://hr.wikipedia.org/wiki/Brekinja	2016-09-26 01:32:49 +02:00
Jehan	f262b1d65b	LangModels: add Italian support. Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added since other encoding end up similar as ISO-8859-1 for most common texts (i.e. glyphs used in Italian are on the same codepoints on these other encodings). Test text from https://it.wikipedia.org/wiki/Architettura_longobarda	2016-09-21 18:52:09 +02:00
Jehan	87d0c16e0e	README: add Finnish support.	2016-09-21 18:35:26 +02:00
Jehan	ac4aa94b73	README: add Polish support… … and update "Mac-CentralEurope" into "MAC-CENTRALEUROPE" (as in iconv).	2016-09-21 17:38:22 +02:00
Jehan	f314b76c0a	README: add Slovak support.	2016-09-21 13:42:31 +02:00
Jehan	5680cba0b8	README: adding Czech and Maltese support information.	2016-09-21 03:45:40 +02:00
Jehan	d810f1175b	README: update Latvian and Lithuanian support. Uchardet now recognizes these langs also with ISO-8859-4 and ISO-8859-10.	2016-09-21 00:35:23 +02:00
Jehan	9f7ed67166	README: add info on Portuguese support.	2016-09-21 00:05:12 +02:00
Jehan	e98d257ec4	README: add ISO-8859-13 for Latvian and Lithuanian support.	2016-09-20 23:35:12 +02:00
Jehan	2a559e7b52	README, test: update README and rename EUC-KR test to UHC.	2016-09-19 01:44:32 +02:00
Jehan	8a8d6b654c	Release: version 0.0.6.	2016-07-20 01:47:50 +02:00
Jehan	771d78b7df	Update the URL links: uchardet is now a freedesktop project.	2016-07-20 01:47:50 +02:00
Jehan	20eb319359	README: make the licenses as a list. This was breaking as markdown by not creating linefeeds.	2016-07-20 00:21:07 +02:00
Jehan	602c1ab0fc	README, COPYING: adding links and text of licenses GPL 2.0 and LGPL 2.1. Thanks to Ilya Tumaykin for reporting the missing info.	2016-06-04 14:21:38 +02:00
Jehan	d5dba26e04	README: add Danish support for 3 charsets.	2016-02-19 19:11:56 +01:00
Jehan	1694999bce	README: update with VISCII support.	2016-02-13 03:52:06 +01:00
Jehan	178c6119b8	LangModels: add Windows-1258 support for Vietnamese. I was planning on adding VISCII support as well, but Python encode() method does not have any support for it apparently, so I cannot generate the proper statistics data with the current version of the string.	2016-02-13 02:32:57 +01:00
Jehan	0446e24c8d	README: uchardet now available on Fedora. Already in Fedora devel and soon to be added as update on Fedora 23, if I get it correctly. See: https://bugzilla.redhat.com/show_bug.cgi?id=1264713 https://admin.fedoraproject.org/pkgdb/package/rpms/uchardet/	2016-02-12 17:53:22 +01:00
Jehan	9c3c37517c	LangModels: add Arabic support. Models constructed for ISO-8859-6 and Windows-1256.	2015-12-13 18:42:16 +01:00
Jehan	ffabb65712	LangModels: adding Spanish support. With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.	2015-12-12 18:54:35 +01:00
Jehan	886e03a523	Release: version 0.0.5.	2015-12-04 22:45:26 +01:00
Jehan	2856e68aac	README: reorganize support list by alphabetic order. (Except for "International" and "Others")	2015-12-04 03:33:22 +01:00
Jehan	dc03ea002f	README: supports are per-language rather than per script system. In particular separate "Cyrillic" into "Russian" and "Bulgarian" (currently our only 2 supported languages using Cyrillic script).	2015-12-04 03:22:05 +01:00
Jehan	fb3c47a073	LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models. ISO-8859-11 is basically exactly identical to TIS-620, with the added non-breaking space character. Basically our detection will always return TIS-620 except for exceptional cases when a text has a non-breaking space.	2015-12-04 03:14:52 +01:00
Jehan	5ee1c3ee39	LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9.	2015-12-04 02:35:09 +01:00
Jehan	f0e122b506	LangModels: add Esperanto ISO-8859-3 language model.	2015-12-04 01:35:56 +01:00
Jehan	b56a3c7b84	README: add German support.	2015-12-04 00:07:03 +01:00

1 2

70 Commits