Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and
Windows-1257.
Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov
Windows-1257 and ISO-8859-13 are very close so I added quotation marks
(Jutumärgid) which are on codepoints only present in ISO-8859-13,
making both encoding apart.
Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15
and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added
since other encoding end up similar as ISO-8859-1 for most common texts
(i.e. glyphs used in Italian are on the same codepoints on these other
encodings).
Test text from https://it.wikipedia.org/wiki/Architettura_longobarda
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
The lib used to return "" for both properly detected ASCII and
detection failure. And the tool would return "ascii/unknown".
Make a proper distinction between the 2 cases.
It was not clear if our naming followed any kind of rules. In particular,
iconv is a widely used encoding conversion API. We will follow its
naming.
At least 1 returned name was found invalid: x-euc-tw instead of EUC-TW.
Other names have been uppercased to follow naming from `iconv --list`
though iconv is mostly case-insensitive so it should not have been a
problem. "Just in case".
Prober names can still have free naming (only used for output display
apparently).
Finally HZ-GB-2312 is absent from my iconv list, but I can still see
this encoding in libiconv master code with this name. So I will
consider it valid.