uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-14 23:50:05 +08:00

Author	SHA1	Message	Date
Jehan	a3a271dfd5	LangModels: Estonian models created. Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and Windows-1257. Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov Windows-1257 and ISO-8859-13 are very close so I added quotation marks (Jutumärgid) which are on codepoints only present in ISO-8859-13, making both encoding apart.	2016-09-27 00:14:29 +02:00
Jehan	3c6d31f5c2	LangModels: new Croatian models. Supports: ISO-8859-2, ISO-8859-13, ISO-8859-16, IBM852, Windows-1250 and MAC-CENTRALEUROPE. Test text from https://hr.wikipedia.org/wiki/Brekinja	2016-09-26 01:32:49 +02:00
Jehan	f262b1d65b	LangModels: add Italian support. Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added since other encoding end up similar as ISO-8859-1 for most common texts (i.e. glyphs used in Italian are on the same codepoints on these other encodings). Test text from https://it.wikipedia.org/wiki/Architettura_longobarda	2016-09-21 18:52:09 +02:00
Jehan	87d0c16e0e	README: add Finnish support.	2016-09-21 18:35:26 +02:00
Jehan	ac4aa94b73	README: add Polish support… … and update "Mac-CentralEurope" into "MAC-CENTRALEUROPE" (as in iconv).	2016-09-21 17:38:22 +02:00
Jehan	f314b76c0a	README: add Slovak support.	2016-09-21 13:42:31 +02:00
Jehan	5680cba0b8	README: adding Czech and Maltese support information.	2016-09-21 03:45:40 +02:00
Jehan	d810f1175b	README: update Latvian and Lithuanian support. Uchardet now recognizes these langs also with ISO-8859-4 and ISO-8859-10.	2016-09-21 00:35:23 +02:00
Jehan	9f7ed67166	README: add info on Portuguese support.	2016-09-21 00:05:12 +02:00
Jehan	e98d257ec4	README: add ISO-8859-13 for Latvian and Lithuanian support.	2016-09-20 23:35:12 +02:00
Jehan	2a559e7b52	README, test: update README and rename EUC-KR test to UHC.	2016-09-19 01:44:32 +02:00
Jehan	8a8d6b654c	Release: version 0.0.6.	2016-07-20 01:47:50 +02:00
Jehan	771d78b7df	Update the URL links: uchardet is now a freedesktop project.	2016-07-20 01:47:50 +02:00
Jehan	20eb319359	README: make the licenses as a list. This was breaking as markdown by not creating linefeeds.	2016-07-20 00:21:07 +02:00
Jehan	602c1ab0fc	README, COPYING: adding links and text of licenses GPL 2.0 and LGPL 2.1. Thanks to Ilya Tumaykin for reporting the missing info.	2016-06-04 14:21:38 +02:00
Jehan	d5dba26e04	README: add Danish support for 3 charsets.	2016-02-19 19:11:56 +01:00
Jehan	1694999bce	README: update with VISCII support.	2016-02-13 03:52:06 +01:00
Jehan	178c6119b8	LangModels: add Windows-1258 support for Vietnamese. I was planning on adding VISCII support as well, but Python encode() method does not have any support for it apparently, so I cannot generate the proper statistics data with the current version of the string.	2016-02-13 02:32:57 +01:00
Jehan	0446e24c8d	README: uchardet now available on Fedora. Already in Fedora devel and soon to be added as update on Fedora 23, if I get it correctly. See: https://bugzilla.redhat.com/show_bug.cgi?id=1264713 https://admin.fedoraproject.org/pkgdb/package/rpms/uchardet/	2016-02-12 17:53:22 +01:00
Jehan	9c3c37517c	LangModels: add Arabic support. Models constructed for ISO-8859-6 and Windows-1256.	2015-12-13 18:42:16 +01:00
Jehan	ffabb65712	LangModels: adding Spanish support. With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.	2015-12-12 18:54:35 +01:00
Jehan	886e03a523	Release: version 0.0.5.	2015-12-04 22:45:26 +01:00
Jehan	2856e68aac	README: reorganize support list by alphabetic order. (Except for "International" and "Others")	2015-12-04 03:33:22 +01:00
Jehan	dc03ea002f	README: supports are per-language rather than per script system. In particular separate "Cyrillic" into "Russian" and "Bulgarian" (currently our only 2 supported languages using Cyrillic script).	2015-12-04 03:22:05 +01:00
Jehan	fb3c47a073	LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models. ISO-8859-11 is basically exactly identical to TIS-620, with the added non-breaking space character. Basically our detection will always return TIS-620 except for exceptional cases when a text has a non-breaking space.	2015-12-04 03:14:52 +01:00
Jehan	5ee1c3ee39	LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9.	2015-12-04 02:35:09 +01:00
Jehan	f0e122b506	LangModels: add Esperanto ISO-8859-3 language model.	2015-12-04 01:35:56 +01:00
Jehan	b56a3c7b84	README: add German support.	2015-12-04 00:07:03 +01:00
Jehan	90728e4068	README: update with Windows-1252 support information.	2015-12-03 21:25:53 +01:00
Jehan	60f641bf37	Update README to mark independence with original Mozilla code.	2015-12-03 20:32:57 +01:00
Jehan	e4260f4a39	Release: version 0.0.4.	2015-12-03 19:48:58 +01:00
Jehan	ba56d91808	Update uchardet URL in various places.	2015-12-03 19:48:29 +01:00
Jehan	d1bc09e4d7	Update authors. I think I deserved being listed in the authors by now. ;-)	2015-12-03 19:44:13 +01:00
Jehan	683255278d	Re-enable Hungarian language models. Now that we have at least one model for ISO-8859-1, the risk of detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.	2015-12-02 22:24:36 +01:00
Jehan	92efc0b0b0	Update README: Unicode is "International".	2015-11-28 19:44:13 +01:00
Jehan	0289c2a232	Differentiate ASCII and detection failure. The lib used to return "" for both properly detected ASCII and detection failure. And the tool would return "ascii/unknown". Make a proper distinction between the 2 cases.	2015-11-28 17:04:52 +01:00
Jehan	4dbc6e7ab3	Update README with French support.	2015-11-28 02:20:57 +01:00
Jehan	b67370230b	Update README and manual... ... to indicate several files can be specified on command line.	2015-11-27 18:27:11 +01:00
Jehan	c61e65aeb3	s/MACCYRILLIC/MAC-CYRILLIC/ Write encoding names in README same as what uchardet returns.	2015-11-27 18:19:02 +01:00
Jehan	d082704fec	Add Mageia command and specify Mint compatibility.	2015-11-23 17:46:01 +01:00
Jehan	ff5fd5eff9	Release: version 0.0.3.	2015-11-19 15:18:11 +01:00
Jehan	4db0d55692	URL of related project python-chardet has changed.	2015-11-17 21:40:44 +01:00
Jehan	9172b763d1	Add TIS-620 in README (Thai language) and a test file. Test text based on Thai Wikipedia page about the TIS-620 encoding: https://th.wikipedia.org/wiki/TIS-620	2015-11-17 17:39:45 +01:00
Jehan	399c4c4d9e	Add libchardet in related projects. See https://github.com/BYVoid/uchardet/issues/11 for review of differences with uchardet.	2015-11-17 17:12:44 +01:00
Jehan	dc371f3ba9	uchardet_get_charset() must return iconv-compatible names. It was not clear if our naming followed any kind of rules. In particular, iconv is a widely used encoding conversion API. We will follow its naming. At least 1 returned name was found invalid: x-euc-tw instead of EUC-TW. Other names have been uppercased to follow naming from `iconv --list` though iconv is mostly case-insensitive so it should not have been a problem. "Just in case". Prober names can still have free naming (only used for output display apparently). Finally HZ-GB-2312 is absent from my iconv list, but I can still see this encoding in libiconv master code with this name. So I will consider it valid.	2015-11-17 16:15:21 +01:00
Jehan	d0ccdd5db9	Release: version 0.0.2.	2015-11-16 15:56:45 +01:00
Carbo Kuo	69b7133995	Add a link to rust-uchardet on README	2014-11-20 20:06:41 +01:00
Carbo Kuo	6caa8f6580	Add README	2013-11-08 07:02:50 +08:00

48 Commits