The Hebrew Model had never been regenerated by my scripts. I now added
the base generation files.
Note that I added 2 charsets: ISO-8859-8 and WINDOWS-1255 but they are
nearly identical. One of the difference is that the generic currency
sign is replaced by the sheqel sign (Israel currency) in Windows-1255.
And though this one lost the "double low line", apparently some Yiddish
characters were added. Basically it looks like most Hebrew text would
work fine with the same confidence on both charsets and detecting both
is likely irrelevant. So I keep the charset file for ISO-8859-8, but
won't actually use it.
The good part is now that Hebrew is also recognized in UTF-8 text thanks
to the new code and newly generated language model.
No functional change yet because all probers still return 1 candidate.
Yet now we add a GetCandidates() method to return a number of
candidates.
GetCharSetName(), GetLanguage() and GetConfidence() now take a parameter
which is the candidate index (which must be below the return value of
GetCandidates()). We can now consider that nsCharSetProber computes a
couple (charset, language) and that the confidence is for this specific
couple, not just the confidence for charset detection.
Now the UTF-8 prober would not only detect valid UTF-8, but would also
detect the most probable language. Using the data generated 2 commits
away, this works very well.
This is still basic and will require even more improvements. In
particular, now the nsUTF8Prober should return an array of ("UTF-8",
language) couple candidate. And nsMBCSGroupProber should itself forward
these candidates as well as other candidates from other multi-byte
detectors. This way, the public-facing API would get more probable
candidates, in case the algorithm is slightly wrong.
Also the UTF-8 confidence is currently stupidly high as soon as we
consider it to be right. We should likely weigh it with language
detection (in particular, if no language is detected, this should
severely weigh down UTF-8 detection; not to 0, but high enough to be a
fallback in case no other encoding+lang is valid and low enough to give
chances to other good candidate couples.
This doesn't work for all probers yet, in particular not for the most
generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL.
It's still a good first step.
Right now, it returns the 2-character language code from ISO 639-1. A
using project could easily get the English language name from the
XML/json files provided by the iso-codes project. This project will also
allow to easily localize the language name in other languages through
gettext (this is what we do in GIMP for instance). I don't add any
dependency though and leave it to downstream projects to implement this.
I was also wondering if we want to support region information for cases
when it would make sense. I especially wondered about it for Chinese
encodings as some of them seem quite specific to a region (according to
Wikipedia at least). For the time being though, these just return "zh".
We'll see later if it makes sense to be more accurate (maybe depending
on reports?).
This manual incrementation code is just horrible and so error-prone.
Some day, we should make a cleaner loop to register all these
single-byte charset probers.
Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and
Windows-1257.
Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov
Windows-1257 and ISO-8859-13 are very close so I added quotation marks
(Jutumärgid) which are on codepoints only present in ISO-8859-13,
making both encoding apart.
Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15
and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added
since other encoding end up similar as ISO-8859-1 for most common texts
(i.e. glyphs used in Italian are on the same codepoints on these other
encodings).
Test text from https://it.wikipedia.org/wiki/Architettura_longobarda
I built models for ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-13,
ISO-8859-15 and WINDOWS-1252, which all contain Finnish letters.
Nevertheless most texts in these encoding end up the same (same
codepoints for the Finnish glyphs) so I keep only tests for ISO-8859-1
and UTF-8. Models for other encoding may still be useful when processing
texts with some symbols, etc.
Encodings are the same as Czech (Windows-1250, ISO-8859-2 and
Mac-CentralEurope) since the resource I found indicate they used the
same encodings historically.
Also it is to be noted that the test examples' encoding were already
properly detected through Czech's models so the languages are definitely
very close, even statistically. Nevertheless adding the right models
will work better and these get better scores. This will take all its
meaning when uchardet will also be used as a language detector (in some
not-too-far future, hopefully!).
Test text taken from: https://sk.wikipedia.org/wiki/Jupiter
Encodings: Windows-1250, ISO-8859-2, IBM852 and Mac-CentralEurope.
Other encodings are known to have been used for Czech: Kamenicky,
KOI-8 CS2 and Cork. But these are uncommon enough that I decided not
to support them (especially since I can't find them supported in iconv
either, or at least not under an alias which I could recognize).
This web page, which contents was made under the Public Domain, is a
good reference for encodings which were used historically for Czech and
Slovak: http://luki.sdf-eu.org/txt/cs-encodings-faq.html
Just realizing that these 2 language can also be encoded with these
charsets (even though ISO-8859-13 would appear to be more common…
maybe?). Anyway now the models are updated and can recognize texts
using these encoding for these languages.
Added some test files as well, which work great.
I actually added also couples with ISO-8859-9, ISO-8859-15 and
Windows-1252. Nevertheless there are no differences on the main
characters related to Portuguese so differences will hardly be made
and detection will usually return ISO-8859-1 only.
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
Mostly generated with a script from Wikipedia data (only the typical
positive ratio is slightly modified).
This is a first test before adding my generating script to the main tree.