42 Commits

Author SHA1 Message Date
Jehan
60dcec8a82 script, src, test: add Ukrainian support.
UTF-8 and Windows-1251 support for now.

This actually breaks ru:windows-1251 test but same as Bulgarian, I never
generated Russian models with my scripts, so the models we currently use
are quite outdated. It will obviously be a lot better once we have new
Russian models.

The test file contents comes from 'Бабак' page on Wikipedia in
Ukrainian.
2022-12-17 21:40:56 +01:00
Jehan
0fffc109b5 script, src, test: adding Belarusian support.
Support for UTF-8, Windows-1251 and ISO-8859-5.
The test contents comes from page 'Суркі' on Wikipedia in Belarusian.
2022-12-17 19:13:03 +01:00
Jehan
0974920bdd Issue #22: Hebrew CP862 support.
Added in both visual and logical order since Wikipedia says:

> Hebrew text encoded using code page 862 was usually stored in visual
> order; nevertheless, a few DOS applications, notably a word processor
> named EinsteinWriter, stored Hebrew in logical order.

I am not using the nsHebrewProber wrapper (nameProber) for this new
support, because I am really unsure this is of any use. Our statistical
code based on letter and sequence usage should be more than enough to
detect both variants of Hebrew encoding already, and my testing show
that so far (with pretty outstanding score on actual Hebrew tests while
all the other probers return bad scores). This will have to be studied a
bit more later and maybe the whole nsHebrewProber might be deleted, even
for Windows-1255 charset.

I'm also cleaning a bit nsSBCSGroupProber::nsSBCSGroupProber() code by
incrementing a single index, instead of maintaining the indexes by hand
(otherwise each time we add probers in the middle, to keep them
logically gathered by languages, we have to manually increment dozens of
following probers).
2022-12-16 23:27:52 +01:00
Jehan
b5b75b81ce script, src: rebuild the Danish model.
Now that it has IBM865 support on the main branch and that I rebased,
this feature branch for the new API got broken too.
2022-12-14 00:24:53 +01:00
Jehan
bfa4b10d4d script, src: add English language model.
English detection is still quite crappy so I don't add a unit test yet.
Though I believe the detection being bad is mostly because of too much
shortcutting we are doing to go "fast". I should probably review this
whole part of the logics as well.
2022-12-14 00:24:53 +01:00
Jehan
2127f4fc0d src: allow for nsCharSetProber to return several candidates.
No functional change yet because all probers still return 1 candidate.
Yet now we add a GetCandidates() method to return a number of
candidates.
GetCharSetName(), GetLanguage() and GetConfidence() now take a parameter
which is the candidate index (which must be below the return value of
GetCandidates()). We can now consider that nsCharSetProber computes a
couple (charset, language) and that the confidence is for this specific
couple, not just the confidence for charset detection.
2022-12-14 00:23:13 +01:00
Jehan
5257fc1abf Using the generic language detector in UTF-8 detection.
Now the UTF-8 prober would not only detect valid UTF-8, but would also
detect the most probable language. Using the data generated 2 commits
away, this works very well.

This is still basic and will require even more improvements. In
particular, now the nsUTF8Prober should return an array of ("UTF-8",
language) couple candidate. And nsMBCSGroupProber should itself forward
these candidates as well as other candidates from other multi-byte
detectors. This way, the public-facing API would get more probable
candidates, in case the algorithm is slightly wrong.

Also the UTF-8 confidence is currently stupidly high as soon as we
consider it to be right. We should likely weigh it with language
detection (in particular, if no language is detected, this should
severely weigh down UTF-8 detection; not to 0, but high enough to be a
fallback in case no other encoding+lang is valid and low enough to give
chances to other good candidate couples.
2022-12-14 00:23:13 +01:00
Jehan
5a949265d5 src: new API to get the detected language.
This doesn't work for all probers yet, in particular not for the most
generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL.
It's still a good first step.

Right now, it returns the 2-character language code from ISO 639-1. A
using project could easily get the English language name from the
XML/json files provided by the iso-codes project. This project will also
allow to easily localize the language name in other languages through
gettext (this is what we do in GIMP for instance). I don't add any
dependency though and leave it to downstream projects to implement this.

I was also wondering if we want to support region information for cases
when it would make sense. I especially wondered about it for Chinese
encodings as some of them seem quite specific to a region (according to
Wikipedia at least). For the time being though, these just return "zh".
We'll see later if it makes sense to be more accurate (maybe depending
on reports?).
2022-12-14 00:23:13 +01:00
Jehan
388777be51 script, src, test: add IBM865 support for Danish.
Newly added IBM865 charset (for Norwegian) can also be used for Danish

By the way, I fixed `script/charsets/ibm865.py` as Danish uses the 'da'
ISO 639-1 code by the way, not 'dk' (which is sometimes used for other
codes for Denmark, such as ISO 3166 country code and internet TLD) but
not for the language itself.

For the test, adding some text from the top article of the day on the
Danish Wikipedia, which was about Jimi Hendrix. And that's cool! 🎸 ;-)
2022-11-30 19:57:52 +01:00
Martin T. H. Sandsmark
099a9a4fd6 Add norwegian support 2022-11-30 19:09:09 +01:00
Jehan
64efb1b24c Bug 101031 - Memory leak of nsSBCSGroupProber.
This manual incrementation code is just horrible and so error-prone.
Some day, we should make a cleaner loop to register all these
single-byte charset probers.
2017-05-14 18:24:11 +02:00
Jehan
119fed7e8d LangModels: add Swedish support.
Encodings: ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-15 and
WINDOWS-1252.
Test text from https://sv.wikipedia.org/wiki/Mölle
2016-09-28 22:42:13 +02:00
Jehan
d62154bd6e LangModels: add Slovene support.
Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and
MAC-CENTRALEUROPE.
Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet
2016-09-28 22:13:17 +02:00
Jehan
fbd2efdbe9 LangModels: Romanian support added.
Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852.
Test texts from https://ro.wikipedia.org/wiki/Danemarca
2016-09-28 19:57:50 +02:00
Jehan
a7525b404d LangModels: added support for Irish Gaelic.
Encodings: ISO-8859-1, ISO-8859-9, ISO-8859-15 and WINDOWS-1252.
Test text from:
https://ga.wikipedia.org/wiki/Gluais_théarmaí_seoltóireachta
2016-09-27 00:49:05 +02:00
Jehan
a3a271dfd5 LangModels: Estonian models created.
Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and
Windows-1257.
Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov
Windows-1257 and ISO-8859-13 are very close so I added quotation marks
(Jutumärgid) which are on codepoints only present in ISO-8859-13,
making both encoding apart.
2016-09-27 00:14:29 +02:00
Jehan
3c6d31f5c2 LangModels: new Croatian models.
Supports: ISO-8859-2, ISO-8859-13, ISO-8859-16, IBM852, Windows-1250
and MAC-CENTRALEUROPE.
Test text from https://hr.wikipedia.org/wiki/Brekinja
2016-09-26 01:32:49 +02:00
Jehan
05ba8555cd src: fix number of Single-Byte charset probers. 2016-09-25 14:02:39 +02:00
Jehan
f262b1d65b LangModels: add Italian support.
Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15
and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added
since other encoding end up similar as ISO-8859-1 for most common texts
(i.e. glyphs used in Italian are on the same codepoints on these other
encodings).
Test text from https://it.wikipedia.org/wiki/Architettura_longobarda
2016-09-21 18:52:09 +02:00
Jehan
6bbe7da1ac LangModels: add Finnish support.
I built models for ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-13,
ISO-8859-15 and WINDOWS-1252, which all contain Finnish letters.
Nevertheless most texts in these encoding end up the same (same
codepoints for the Finnish glyphs) so I keep only tests for ISO-8859-1
and UTF-8. Models for other encoding may still be useful when processing
texts with some symbols, etc.
2016-09-21 18:27:39 +02:00
Jehan
3401ac70d0 LangModels: add Polish support.
With the following encodings: ISO-8859-2, ISO-8859-13, ISO-8859-16,
Windows-1250, IBM852, MAC-CENTRALEUROPE.
Test text from https://pl.wikipedia.org/wiki/Zofia_Holszańska
2016-09-21 17:30:15 +02:00
Jehan
5f9ec3aef0 LangModels: add support for Slovak.
Encodings are the same as Czech (Windows-1250, ISO-8859-2 and
Mac-CentralEurope) since the resource I found indicate they used the
same encodings historically.
Also it is to be noted that the test examples' encoding were already
properly detected through Czech's models so the languages are definitely
very close, even statistically. Nevertheless adding the right models
will work better and these get better scores. This will take all its
meaning when uchardet will also be used as a language detector (in some
not-too-far future, hopefully!).
Test text taken from: https://sk.wikipedia.org/wiki/Jupiter
2016-09-21 13:42:20 +02:00
Jehan
26e1cebad1 LangModels: add support for Czech.
Encodings: Windows-1250, ISO-8859-2, IBM852 and Mac-CentralEurope.
Other encodings are known to have been used for Czech: Kamenicky,
KOI-8 CS2 and Cork. But these are uncommon enough that I decided not
to support them (especially since I can't find them supported in iconv
either, or at least not under an alias which I could recognize).
This web page, which contents was made under the Public Domain, is a
good reference for encodings which were used historically for Czech and
Slovak: http://luki.sdf-eu.org/txt/cs-encodings-faq.html
2016-09-21 03:33:50 +02:00
Jehan
2700cf3a83 LangModels: support for Maltese / ISO-8859-3.
Test text from https://mt.wikipedia.org/wiki/Franza.
2016-09-21 02:11:31 +02:00
Jehan
b7aebfdfda LangModels: add support for Latvian | Lithuanian / ISO-8859-4 | ISO-8859-10.
Just realizing that these 2 language can also be encoded with these
charsets (even though ISO-8859-13 would appear to be more common…
maybe?). Anyway now the models are updated and can recognize texts
using these encoding for these languages.
Added some test files as well, which work great.
2016-09-21 00:27:16 +02:00
Jehan
e138839f07 LangModels: add support for Portuguese / ISO-8859-1.
I actually added also couples with ISO-8859-9, ISO-8859-15 and
Windows-1252. Nevertheless there are no differences on the main
characters related to Portuguese so differences will hardly be made
and detection will usually return ISO-8859-1 only.
2016-09-21 00:01:07 +02:00
Jehan
ea2f4dd40f LangModels: new support for Latvian / ISO-8859-13.
Test text extracted from: https://lv.wikipedia.org/wiki/Vinsents_van_Gogs
2016-09-20 23:29:53 +02:00
Jehan
7cb3dd9ddd LangModels: add support for Lithuanian / ISO-8859-13.
Test text extracted from https://lt.wikipedia.org/wiki/Vincent_van_Gogh.
2016-09-20 23:09:24 +02:00
Jehan
923d264470 LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15).
Test for ISO-8859-1 is disabled for now since the difference is not big
enough, as for characters used in Danish, between ISO-8859-1 and
ISO-8859-15. Therefore the first to be declared "wins".
Let's see to improve this later.
Test contents from:
https://da.wikipedia.org/wiki/Eurosymbol
https://da.wikipedia.org/wiki/Dansk_%28sprog%29
2016-02-19 19:10:41 +01:00
Jehan
98b5e52252 LangModels: add VISCII encoding support and retrain Vietnamese model. 2016-02-13 03:51:18 +01:00
Jehan
178c6119b8 LangModels: add Windows-1258 support for Vietnamese.
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
2016-02-13 02:32:57 +01:00
Jehan
9c3c37517c LangModels: add Arabic support.
Models constructed for ISO-8859-6 and Windows-1256.
2015-12-13 18:42:16 +01:00
Jehan
ffabb65712 LangModels: adding Spanish support.
With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.
2015-12-12 18:54:35 +01:00
Jehan
fb3c47a073 LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models.
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
2015-12-04 03:14:52 +01:00
Jehan
5ee1c3ee39 LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9. 2015-12-04 02:35:09 +01:00
Jehan
f0e122b506 LangModels: add Esperanto ISO-8859-3 language model. 2015-12-04 01:35:56 +01:00
Jehan
aa587a64bd LangModels: adding German models for ISO-8859-1 and Windows-1252. 2015-12-03 23:58:41 +01:00
Jehan
0270b1e856 Adding French Windows-1252 support. 2015-12-03 21:22:30 +01:00
Jehan
683255278d Re-enable Hungarian language models.
Now that we have at least one model for ISO-8859-1, the risk of
detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.
2015-12-02 22:24:36 +01:00
Jehan
005fd98086 Add initial support for French with ISO-8859-1 and ISO-8859-15.
Mostly generated with a script from Wikipedia data (only the typical
positive ratio is slightly modified).
This is a first test before adding my generating script to the main tree.
2015-11-28 02:14:39 +01:00
BYVoid
84284eccf4 Update code from upstream. 2011-07-11 14:42:50 +08:00
BYVoid
3601900164 Initial release. 2011-07-10 15:04:42 +08:00