80 Commits

Author SHA1 Message Date
Jehan
7875272a8c script, src, test: new Georgian support.
For charsets UTF-8, GEORGIAN-ACADEMY and GEORGIAN-PS. The 2 GEORGIAN-*
sets were generated thanks to the new create-table.py script.

Test text comes from page 'ვირზაზუნა' page of Wikipedia in Georgian.
2022-12-20 14:28:29 +01:00
Jehan
d40e5868d5 script, src, test: adding Catalan support.
For UTF-8, ISO-8859-1 and WINDOWS-1252 support.

The test for UTF-8 and ISO-8859-1 is taken from 'Marmota' page on
Wikipedia in Catalan. The test for WINDOWS-1252 is taken from the
'Unió_Europea' page. ISO-8859-1 and WINDOWS-1252 being very similar,
regarding most letters (in particular the ones used in Catalan), I
differentiated the test with a text containing the '€' symbol, which is
on an unused spot in ISO-8859-1.
2022-12-20 01:46:15 +01:00
Jehan
0fe51d3851 Issue #21: Greek CP737 support.
It actually breaks "zh:big5" so I'm going to hold-off a bit. Adding more
language and charset support is slowly starting to show the limitations
of our legacy multi-byte charset supports, since I haven't really
touched these since the original implementation of Mozilla.

It might be time to start reviewing these parts of the code.

The test file contents comes from 'Μαρμότα' page on Wikipedia in Greek
(though since 2 letters are missing in this encoding, despite its
popularity for Greek, I had to be careful in choosing pieces of text
without such letters).
2022-12-18 22:33:12 +01:00
Jehan
d6cab28fb4 README: missing UTF-8 support listed on several languages. 2022-12-17 23:00:26 +01:00
Jehan
abd123e07d script, src, test: add Serbian support.
For UTF-8, ISO-8859-5 and WINDOWS-1251.

Test files' contents come from page 'Мрмот' on Wikipedia in Serbian.
2022-12-17 22:47:54 +01:00
Jehan
d00d4d52b7 src, script: add Macedonian support.
For UTF-8, ISO-8859-5, WINDOWS-1251 and IBM855 encodings.

Test files' contents come from page 'Хибернација' on Wikipedia in
Macedonian.
2022-12-17 22:47:54 +01:00
Jehan
41d309e8a2 script, src: regenerate Russian models and add UTF-8/Russian support.
This fixes the broken Russian test in Windows-1251 which once again gets
a much better score with Russian. Also this adds UTF-8 support.

Same as Bulgarian, I wonder why I had not regenerated this earlier.

The new UTF-8 test comes from the 'Сурки' page of Wikipedia in Russian.

Note that now this broke the test zh:gb18030 (the score for KOI8-R / ru
(0.766388) beats GB18030 / zh (0.700000)). I think I'll have to look a
bit closer at our GB18030 dedicated prober.
2022-12-17 21:41:11 +01:00
Jehan
60dcec8a82 script, src, test: add Ukrainian support.
UTF-8 and Windows-1251 support for now.

This actually breaks ru:windows-1251 test but same as Bulgarian, I never
generated Russian models with my scripts, so the models we currently use
are quite outdated. It will obviously be a lot better once we have new
Russian models.

The test file contents comes from 'Бабак' page on Wikipedia in
Ukrainian.
2022-12-17 21:40:56 +01:00
Jehan
0fffc109b5 script, src, test: adding Belarusian support.
Support for UTF-8, Windows-1251 and ISO-8859-5.
The test contents comes from page 'Суркі' on Wikipedia in Belarusian.
2022-12-17 19:13:03 +01:00
Jehan
ffb94e4a9d script, src, test: Bulgarian language models added.
Not sure why we had the Bulgarian support but haven't recently updated
it (i.e. never with the model generation script, or so it seems),
especially with generic language models, allowing to have
UTF-8/Bulgarian support. Maybe I tested it some time ago and it was
getting bad results? Anyway now with all the recents updates on the
confidence computation, I get very good detection scores.

So adding support for UTF-8/Bulgarian and rebuilding other models too.

Also adding a test for ISO-8859-5/Bulgarian (we already had support, but
no test files).

The 2 new test files are text from page 'Мармоти' on Wikipedia in
Bulgarian language.
2022-12-17 18:41:00 +01:00
Jehan
0974920bdd Issue #22: Hebrew CP862 support.
Added in both visual and logical order since Wikipedia says:

> Hebrew text encoded using code page 862 was usually stored in visual
> order; nevertheless, a few DOS applications, notably a word processor
> named EinsteinWriter, stored Hebrew in logical order.

I am not using the nsHebrewProber wrapper (nameProber) for this new
support, because I am really unsure this is of any use. Our statistical
code based on letter and sequence usage should be more than enough to
detect both variants of Hebrew encoding already, and my testing show
that so far (with pretty outstanding score on actual Hebrew tests while
all the other probers return bad scores). This will have to be studied a
bit more later and maybe the whole nsHebrewProber might be deleted, even
for Windows-1255 charset.

I'm also cleaning a bit nsSBCSGroupProber::nsSBCSGroupProber() code by
incrementing a single index, instead of maintaining the indexes by hand
(otherwise each time we add probers in the middle, to keep them
logically gathered by languages, we have to manually increment dozens of
following probers).
2022-12-16 23:27:52 +01:00
Jehan
c782177a8d README: fix a duplicate. 2022-12-14 00:24:53 +01:00
Jehan
3ca49e2bc1 Update README. 2022-12-14 00:24:50 +01:00
Jehan
2f5c24006e README, doc: some README and release procedure updates. 2022-12-08 22:34:22 +01:00
Jehan
ae6302a016 Release: version 0.0.8. 2022-12-08 21:52:25 +01:00
Jehan
c218a3ccd6 README: add a section about CMake exported targets.
Since it's a new feature, we may as well write about it, even though I
would personally not recommend this in favor of more standard and
generic pkg-config (which is not dependent on which build system we are
using ourselves).
2022-11-30 23:48:16 +01:00
Jehan
6196f86c46 README: update with newly added (lang, charset) couples. 2022-11-30 20:06:52 +01:00
myd7349
143b3fe513 README: update libchardet repository link 2022-08-01 19:38:19 +08:00
Aaron Madlon-Kay
6f38ab95f5 Mention MacPorts in readme 2021-01-27 06:57:58 +00:00
Jehan
c8a3572cca Issue #17: update README.
Replace the old link to the science paper by one on archive-mozilla
website. Remove the original source link as I can't find any archived
version of it (even on archive.org, only the folder structure is saved,
not actual files themselves, so it's useless).

Also add some history, which is probably a nice touch.

Add a link to crossroad to help people who'd want to cross-compile
uchardet.

Finally add the R binding by Artem Klevtsov and QtAV as reported.
2020-04-29 16:20:00 +02:00
Jehan
59f68dbe57 Release: version 0.0.7 2020-04-23 11:48:58 +02:00
Jehan
60bf53c81e README: update to Gitlab links.
Freedesktop moved its infrastructure to Gitlab a while ago.
2020-04-22 00:33:48 +02:00
Jehan
0cfb75724a README: some small updates. 2020-04-22 00:17:23 +02:00
Jehan
bdfd6116a9 Add a mention about fd.o code of conduct. 2018-09-26 15:12:25 +02:00
Jehan
95872ef41c Adding some information about building for Windows. 2017-12-26 03:37:42 +01:00
Jehan
056a5a6e51 README: add some applications having uchardet as dependency.
There are likely more (and I know some are planning support) but these
are the ones I know of and with support already in.
2017-09-21 00:06:03 +02:00
Jehan
d9d014742a README: Gentoo also has a uchardet package.
And it is up-to-date with upstream URL at Freedesktop! Good!
2017-05-28 21:13:59 +02:00
Jehan
d90d01bc9e README: adding a flatpak-builder manifest example.
Thanks to Sébastien Wilmet for the example.
2017-03-24 23:22:40 +01:00
Jehan
119fed7e8d LangModels: add Swedish support.
Encodings: ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-15 and
WINDOWS-1252.
Test text from https://sv.wikipedia.org/wiki/Mölle
2016-09-28 22:42:13 +02:00
Jehan
d62154bd6e LangModels: add Slovene support.
Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and
MAC-CENTRALEUROPE.
Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet
2016-09-28 22:13:17 +02:00
Jehan
fbd2efdbe9 LangModels: Romanian support added.
Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852.
Test texts from https://ro.wikipedia.org/wiki/Danemarca
2016-09-28 19:57:50 +02:00
Jehan
a7525b404d LangModels: added support for Irish Gaelic.
Encodings: ISO-8859-1, ISO-8859-9, ISO-8859-15 and WINDOWS-1252.
Test text from:
https://ga.wikipedia.org/wiki/Gluais_théarmaí_seoltóireachta
2016-09-27 00:49:05 +02:00
Jehan
a3a271dfd5 LangModels: Estonian models created.
Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and
Windows-1257.
Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov
Windows-1257 and ISO-8859-13 are very close so I added quotation marks
(Jutumärgid) which are on codepoints only present in ISO-8859-13,
making both encoding apart.
2016-09-27 00:14:29 +02:00
Jehan
3c6d31f5c2 LangModels: new Croatian models.
Supports: ISO-8859-2, ISO-8859-13, ISO-8859-16, IBM852, Windows-1250
and MAC-CENTRALEUROPE.
Test text from https://hr.wikipedia.org/wiki/Brekinja
2016-09-26 01:32:49 +02:00
Jehan
f262b1d65b LangModels: add Italian support.
Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15
and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added
since other encoding end up similar as ISO-8859-1 for most common texts
(i.e. glyphs used in Italian are on the same codepoints on these other
encodings).
Test text from https://it.wikipedia.org/wiki/Architettura_longobarda
2016-09-21 18:52:09 +02:00
Jehan
87d0c16e0e README: add Finnish support. 2016-09-21 18:35:26 +02:00
Jehan
ac4aa94b73 README: add Polish support…
… and update "Mac-CentralEurope" into "MAC-CENTRALEUROPE" (as in iconv).
2016-09-21 17:38:22 +02:00
Jehan
f314b76c0a README: add Slovak support. 2016-09-21 13:42:31 +02:00
Jehan
5680cba0b8 README: adding Czech and Maltese support information. 2016-09-21 03:45:40 +02:00
Jehan
d810f1175b README: update Latvian and Lithuanian support.
Uchardet now recognizes these langs also with ISO-8859-4 and
ISO-8859-10.
2016-09-21 00:35:23 +02:00
Jehan
9f7ed67166 README: add info on Portuguese support. 2016-09-21 00:05:12 +02:00
Jehan
e98d257ec4 README: add ISO-8859-13 for Latvian and Lithuanian support. 2016-09-20 23:35:12 +02:00
Jehan
2a559e7b52 README, test: update README and rename EUC-KR test to UHC. 2016-09-19 01:44:32 +02:00
Jehan
8a8d6b654c Release: version 0.0.6. 2016-07-20 01:47:50 +02:00
Jehan
771d78b7df Update the URL links: uchardet is now a freedesktop project. 2016-07-20 01:47:50 +02:00
Jehan
20eb319359 README: make the licenses as a list.
This was breaking as markdown by not creating linefeeds.
2016-07-20 00:21:07 +02:00
Jehan
602c1ab0fc README, COPYING: adding links and text of licenses GPL 2.0 and LGPL 2.1.
Thanks to Ilya Tumaykin for reporting the missing info.
2016-06-04 14:21:38 +02:00
Jehan
d5dba26e04 README: add Danish support for 3 charsets. 2016-02-19 19:11:56 +01:00
Jehan
1694999bce README: update with VISCII support. 2016-02-13 03:52:06 +01:00
Jehan
178c6119b8 LangModels: add Windows-1258 support for Vietnamese.
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
2016-02-13 02:32:57 +01:00