70 Commits

Author SHA1 Message Date
Jehan
0974920bdd Issue #22: Hebrew CP862 support.
Added in both visual and logical order since Wikipedia says:

> Hebrew text encoded using code page 862 was usually stored in visual
> order; nevertheless, a few DOS applications, notably a word processor
> named EinsteinWriter, stored Hebrew in logical order.

I am not using the nsHebrewProber wrapper (nameProber) for this new
support, because I am really unsure this is of any use. Our statistical
code based on letter and sequence usage should be more than enough to
detect both variants of Hebrew encoding already, and my testing show
that so far (with pretty outstanding score on actual Hebrew tests while
all the other probers return bad scores). This will have to be studied a
bit more later and maybe the whole nsHebrewProber might be deleted, even
for Windows-1255 charset.

I'm also cleaning a bit nsSBCSGroupProber::nsSBCSGroupProber() code by
incrementing a single index, instead of maintaining the indexes by hand
(otherwise each time we add probers in the middle, to keep them
logically gathered by languages, we have to manually increment dozens of
following probers).
2022-12-16 23:27:52 +01:00
Jehan
c782177a8d README: fix a duplicate. 2022-12-14 00:24:53 +01:00
Jehan
3ca49e2bc1 Update README. 2022-12-14 00:24:50 +01:00
Jehan
2f5c24006e README, doc: some README and release procedure updates. 2022-12-08 22:34:22 +01:00
Jehan
ae6302a016 Release: version 0.0.8. 2022-12-08 21:52:25 +01:00
Jehan
c218a3ccd6 README: add a section about CMake exported targets.
Since it's a new feature, we may as well write about it, even though I
would personally not recommend this in favor of more standard and
generic pkg-config (which is not dependent on which build system we are
using ourselves).
2022-11-30 23:48:16 +01:00
Jehan
6196f86c46 README: update with newly added (lang, charset) couples. 2022-11-30 20:06:52 +01:00
myd7349
143b3fe513 README: update libchardet repository link 2022-08-01 19:38:19 +08:00
Aaron Madlon-Kay
6f38ab95f5 Mention MacPorts in readme 2021-01-27 06:57:58 +00:00
Jehan
c8a3572cca Issue #17: update README.
Replace the old link to the science paper by one on archive-mozilla
website. Remove the original source link as I can't find any archived
version of it (even on archive.org, only the folder structure is saved,
not actual files themselves, so it's useless).

Also add some history, which is probably a nice touch.

Add a link to crossroad to help people who'd want to cross-compile
uchardet.

Finally add the R binding by Artem Klevtsov and QtAV as reported.
2020-04-29 16:20:00 +02:00
Jehan
59f68dbe57 Release: version 0.0.7 2020-04-23 11:48:58 +02:00
Jehan
60bf53c81e README: update to Gitlab links.
Freedesktop moved its infrastructure to Gitlab a while ago.
2020-04-22 00:33:48 +02:00
Jehan
0cfb75724a README: some small updates. 2020-04-22 00:17:23 +02:00
Jehan
bdfd6116a9 Add a mention about fd.o code of conduct. 2018-09-26 15:12:25 +02:00
Jehan
95872ef41c Adding some information about building for Windows. 2017-12-26 03:37:42 +01:00
Jehan
056a5a6e51 README: add some applications having uchardet as dependency.
There are likely more (and I know some are planning support) but these
are the ones I know of and with support already in.
2017-09-21 00:06:03 +02:00
Jehan
d9d014742a README: Gentoo also has a uchardet package.
And it is up-to-date with upstream URL at Freedesktop! Good!
2017-05-28 21:13:59 +02:00
Jehan
d90d01bc9e README: adding a flatpak-builder manifest example.
Thanks to Sébastien Wilmet for the example.
2017-03-24 23:22:40 +01:00
Jehan
119fed7e8d LangModels: add Swedish support.
Encodings: ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-15 and
WINDOWS-1252.
Test text from https://sv.wikipedia.org/wiki/Mölle
2016-09-28 22:42:13 +02:00
Jehan
d62154bd6e LangModels: add Slovene support.
Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and
MAC-CENTRALEUROPE.
Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet
2016-09-28 22:13:17 +02:00
Jehan
fbd2efdbe9 LangModels: Romanian support added.
Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852.
Test texts from https://ro.wikipedia.org/wiki/Danemarca
2016-09-28 19:57:50 +02:00
Jehan
a7525b404d LangModels: added support for Irish Gaelic.
Encodings: ISO-8859-1, ISO-8859-9, ISO-8859-15 and WINDOWS-1252.
Test text from:
https://ga.wikipedia.org/wiki/Gluais_théarmaí_seoltóireachta
2016-09-27 00:49:05 +02:00
Jehan
a3a271dfd5 LangModels: Estonian models created.
Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and
Windows-1257.
Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov
Windows-1257 and ISO-8859-13 are very close so I added quotation marks
(Jutumärgid) which are on codepoints only present in ISO-8859-13,
making both encoding apart.
2016-09-27 00:14:29 +02:00
Jehan
3c6d31f5c2 LangModels: new Croatian models.
Supports: ISO-8859-2, ISO-8859-13, ISO-8859-16, IBM852, Windows-1250
and MAC-CENTRALEUROPE.
Test text from https://hr.wikipedia.org/wiki/Brekinja
2016-09-26 01:32:49 +02:00
Jehan
f262b1d65b LangModels: add Italian support.
Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15
and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added
since other encoding end up similar as ISO-8859-1 for most common texts
(i.e. glyphs used in Italian are on the same codepoints on these other
encodings).
Test text from https://it.wikipedia.org/wiki/Architettura_longobarda
2016-09-21 18:52:09 +02:00
Jehan
87d0c16e0e README: add Finnish support. 2016-09-21 18:35:26 +02:00
Jehan
ac4aa94b73 README: add Polish support…
… and update "Mac-CentralEurope" into "MAC-CENTRALEUROPE" (as in iconv).
2016-09-21 17:38:22 +02:00
Jehan
f314b76c0a README: add Slovak support. 2016-09-21 13:42:31 +02:00
Jehan
5680cba0b8 README: adding Czech and Maltese support information. 2016-09-21 03:45:40 +02:00
Jehan
d810f1175b README: update Latvian and Lithuanian support.
Uchardet now recognizes these langs also with ISO-8859-4 and
ISO-8859-10.
2016-09-21 00:35:23 +02:00
Jehan
9f7ed67166 README: add info on Portuguese support. 2016-09-21 00:05:12 +02:00
Jehan
e98d257ec4 README: add ISO-8859-13 for Latvian and Lithuanian support. 2016-09-20 23:35:12 +02:00
Jehan
2a559e7b52 README, test: update README and rename EUC-KR test to UHC. 2016-09-19 01:44:32 +02:00
Jehan
8a8d6b654c Release: version 0.0.6. 2016-07-20 01:47:50 +02:00
Jehan
771d78b7df Update the URL links: uchardet is now a freedesktop project. 2016-07-20 01:47:50 +02:00
Jehan
20eb319359 README: make the licenses as a list.
This was breaking as markdown by not creating linefeeds.
2016-07-20 00:21:07 +02:00
Jehan
602c1ab0fc README, COPYING: adding links and text of licenses GPL 2.0 and LGPL 2.1.
Thanks to Ilya Tumaykin for reporting the missing info.
2016-06-04 14:21:38 +02:00
Jehan
d5dba26e04 README: add Danish support for 3 charsets. 2016-02-19 19:11:56 +01:00
Jehan
1694999bce README: update with VISCII support. 2016-02-13 03:52:06 +01:00
Jehan
178c6119b8 LangModels: add Windows-1258 support for Vietnamese.
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
2016-02-13 02:32:57 +01:00
Jehan
0446e24c8d README: uchardet now available on Fedora.
Already in Fedora devel and soon to be added as update on Fedora 23,
if I get it correctly. See:
https://bugzilla.redhat.com/show_bug.cgi?id=1264713
https://admin.fedoraproject.org/pkgdb/package/rpms/uchardet/
2016-02-12 17:53:22 +01:00
Jehan
9c3c37517c LangModels: add Arabic support.
Models constructed for ISO-8859-6 and Windows-1256.
2015-12-13 18:42:16 +01:00
Jehan
ffabb65712 LangModels: adding Spanish support.
With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.
2015-12-12 18:54:35 +01:00
Jehan
886e03a523 Release: version 0.0.5. 2015-12-04 22:45:26 +01:00
Jehan
2856e68aac README: reorganize support list by alphabetic order.
(Except for "International" and "Others")
2015-12-04 03:33:22 +01:00
Jehan
dc03ea002f README: supports are per-language rather than per script system.
In particular separate "Cyrillic" into "Russian" and "Bulgarian"
(currently our only 2 supported languages using Cyrillic script).
2015-12-04 03:22:05 +01:00
Jehan
fb3c47a073 LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models.
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
2015-12-04 03:14:52 +01:00
Jehan
5ee1c3ee39 LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9. 2015-12-04 02:35:09 +01:00
Jehan
f0e122b506 LangModels: add Esperanto ISO-8859-3 language model. 2015-12-04 01:35:56 +01:00
Jehan
b56a3c7b84 README: add German support. 2015-12-04 00:07:03 +01:00