For UTF-8, ISO-8859-1 and WINDOWS-1252 support.
The test for UTF-8 and ISO-8859-1 is taken from 'Marmota' page on
Wikipedia in Catalan. The test for WINDOWS-1252 is taken from the
'Unió_Europea' page. ISO-8859-1 and WINDOWS-1252 being very similar,
regarding most letters (in particular the ones used in Catalan), I
differentiated the test with a text containing the '€' symbol, which is
on an unused spot in ISO-8859-1.
It actually breaks "zh:big5" so I'm going to hold-off a bit. Adding more
language and charset support is slowly starting to show the limitations
of our legacy multi-byte charset supports, since I haven't really
touched these since the original implementation of Mozilla.
It might be time to start reviewing these parts of the code.
The test file contents comes from 'Μαρμότα' page on Wikipedia in Greek
(though since 2 letters are missing in this encoding, despite its
popularity for Greek, I had to be careful in choosing pieces of text
without such letters).
This fixes the broken Russian test in Windows-1251 which once again gets
a much better score with Russian. Also this adds UTF-8 support.
Same as Bulgarian, I wonder why I had not regenerated this earlier.
The new UTF-8 test comes from the 'Сурки' page of Wikipedia in Russian.
Note that now this broke the test zh:gb18030 (the score for KOI8-R / ru
(0.766388) beats GB18030 / zh (0.700000)). I think I'll have to look a
bit closer at our GB18030 dedicated prober.
UTF-8 and Windows-1251 support for now.
This actually breaks ru:windows-1251 test but same as Bulgarian, I never
generated Russian models with my scripts, so the models we currently use
are quite outdated. It will obviously be a lot better once we have new
Russian models.
The test file contents comes from 'Бабак' page on Wikipedia in
Ukrainian.
Not sure why we had the Bulgarian support but haven't recently updated
it (i.e. never with the model generation script, or so it seems),
especially with generic language models, allowing to have
UTF-8/Bulgarian support. Maybe I tested it some time ago and it was
getting bad results? Anyway now with all the recents updates on the
confidence computation, I get very good detection scores.
So adding support for UTF-8/Bulgarian and rebuilding other models too.
Also adding a test for ISO-8859-5/Bulgarian (we already had support, but
no test files).
The 2 new test files are text from page 'Мармоти' on Wikipedia in
Bulgarian language.
This is the same text, taken from this Wikipedia page, which was today's
page of honor on Wikipedia in Hebrew:
https://he.wikipedia.org/wiki/שתי מסכתות על ממשל מדיני
I put it in 2 variants, since IBM862 can be used in logical and visual
variants. The visual variant is just about inverting orders of letters
(per lines, while lines stay in proper order), so that's what I did.
Though note that the English title quoted in the text should likely not
have been reverted, but it doesn't matter too much since anyway these
are off-Hebrew alphabet and would trigger bad sequence score, whichever
their order. So I didn't bother fixing these.
While the expected charset name is still the first part of the test file
(until the first point character), the test name is all but the last
part (until the last point character). This will allow to have several
test files for a single charset.
In particular, I want 2 test files at least for Hebrew when it has a
visual and logical variant. So I could call these "ibm862.visual.txt"
and "ibm862.logical.txt" which both expect IBM862 as a result charset,
but test names will "he:ibm862.visual" and he:ibm862.logical"
respectively. Without this change, the test names would collide and
CMake would refuse these.
… and rebuild of models.
The scores are really not bad now, 0.896026 for Norwegian and 0.877947
for Danish. It looks like the last confidence computation changes I did
are really giving fruits!
I had this test file locally for some time now, but it was always
failing, and recognized as other languages until now. Thanks to the
recent confidence improvements with new frequent/rare ratios, it is
finally detected as English by uchardet!
It currently recognizes as Danish/UTF-8 with 0.958 score, though
Norwegian/UTF-8 is indeed the second candidate with 0.911 (the third
candidate is far behind, Swedish/UTF-8 with 0.815). Before wasting time
tweaking models, there are more basic conceptual changes that I want to
implement first (it might be enough to change the results!). So let's
skip this test for now.
realpath() doesn't exist on Windows. Replace it with _fullpath() which
does the same thing, as far as I can see (at least for creating an
absolute path, it doesn't seem to canonicalize the path, or the docs
doesn't say it, yet since we are controlling the arguments from our
CMake script, it's not a big problem anyway).
This fixed the CI build for Windows failing with:
> undefined reference to `realpath'
This prober comes from MR !1 on the main branch though it was too
agressive then and could not get merged. On the improved API branch, it
doesn't detect other tests as Johab anymore.
Also fixing it to work with the new API.
Finally adding a Johab/ko unit test.
Taken from random pages for each of these languages.
I now have a test for every 26 supported couple of (UTF-8, language).
These are all working very fine and detected at the right encoding and
language.
Newly added IBM865 charset (for Norwegian) can also be used for Danish
By the way, I fixed `script/charsets/ibm865.py` as Danish uses the 'da'
ISO 639-1 code by the way, not 'dk' (which is sometimes used for other
codes for Denmark, such as ISO 3166 country code and internet TLD) but
not for the language itself.
For the test, adding some text from the top article of the day on the
Danish Wikipedia, which was about Jimi Hendrix. And that's cool! 🎸 ;-)
Taken from the page: https://mt.wikipedia.org/wiki/Lingwa_Maltija
The old test was fine but had some French words in it, which lowered the
confidence for Maltese.
Technically it should not be a huge issue in the end, i.e. that if there
are enough actual Maltese words, the stats should still weigh in favor
of Maltese likeness (which they mostly did anyway), but since I am
making some other changes, this was just not enough. In particular I was
changing some of the UTF-8 confidence logics and the file ended up
detected as UTF-8 (even though it has illegal sequence and cannot be!
Cf. #9).
So the real long-term solution is to actually fix our UTF-8 detector,
which I'll do at some point, but for the time being, let's have definite
non-questionable Maltese in there to simplify testing at this early
stage of uchardet rewriting.
It is unneeded to do it by target, using the globale property
CMAKE_CXX_STANDARD instead. Also with CMAKE_CXX_STANDARD_REQUIRED, I
make this a strong requirement. The documentation indeed states that the
CXX_STANDARD "is treated as optional and may “decay” to a previous
standard if the requested is not available".
This means that uchardet will likely not be buildable with a compiler
with no C++11 support. But I assume this is not a common situation, and
probably we should not care about outdated compilers. I remain open to
suggestions and disagreement on the topic obviously.
As discussed in bug 101032, it seems like the most common usage
nowadays. Let's make a specific choice to avoid different behavior on
different builds later on.
Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and
Windows-1257.
Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov
Windows-1257 and ISO-8859-13 are very close so I added quotation marks
(Jutumärgid) which are on codepoints only present in ISO-8859-13,
making both encoding apart.
Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15
and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added
since other encoding end up similar as ISO-8859-1 for most common texts
(i.e. glyphs used in Italian are on the same codepoints on these other
encodings).
Test text from https://it.wikipedia.org/wiki/Architettura_longobarda
I built models for ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-13,
ISO-8859-15 and WINDOWS-1252, which all contain Finnish letters.
Nevertheless most texts in these encoding end up the same (same
codepoints for the Finnish glyphs) so I keep only tests for ISO-8859-1
and UTF-8. Models for other encoding may still be useful when processing
texts with some symbols, etc.
Encodings are the same as Czech (Windows-1250, ISO-8859-2 and
Mac-CentralEurope) since the resource I found indicate they used the
same encodings historically.
Also it is to be noted that the test examples' encoding were already
properly detected through Czech's models so the languages are definitely
very close, even statistically. Nevertheless adding the right models
will work better and these get better scores. This will take all its
meaning when uchardet will also be used as a language detector (in some
not-too-far future, hopefully!).
Test text taken from: https://sk.wikipedia.org/wiki/Jupiter
Just realizing that these 2 language can also be encoded with these
charsets (even though ISO-8859-13 would appear to be more common…
maybe?). Anyway now the models are updated and can recognize texts
using these encoding for these languages.
Added some test files as well, which work great.
I actually added also couples with ISO-8859-9, ISO-8859-15 and
Windows-1252. Nevertheless there are no differences on the main
characters related to Portuguese so differences will hardly be made
and detection will usually return ISO-8859-1 only.