20 Commits

Author SHA1 Message Date
Jehan
26e1cebad1 LangModels: add support for Czech.
Encodings: Windows-1250, ISO-8859-2, IBM852 and Mac-CentralEurope.
Other encodings are known to have been used for Czech: Kamenicky,
KOI-8 CS2 and Cork. But these are uncommon enough that I decided not
to support them (especially since I can't find them supported in iconv
either, or at least not under an alias which I could recognize).
This web page, which contents was made under the Public Domain, is a
good reference for encodings which were used historically for Czech and
Slovak: http://luki.sdf-eu.org/txt/cs-encodings-faq.html
2016-09-21 03:33:50 +02:00
Jehan
2700cf3a83 LangModels: support for Maltese / ISO-8859-3.
Test text from https://mt.wikipedia.org/wiki/Franza.
2016-09-21 02:11:31 +02:00
Jehan
b7aebfdfda LangModels: add support for Latvian | Lithuanian / ISO-8859-4 | ISO-8859-10.
Just realizing that these 2 language can also be encoded with these
charsets (even though ISO-8859-13 would appear to be more common…
maybe?). Anyway now the models are updated and can recognize texts
using these encoding for these languages.
Added some test files as well, which work great.
2016-09-21 00:27:16 +02:00
Jehan
e138839f07 LangModels: add support for Portuguese / ISO-8859-1.
I actually added also couples with ISO-8859-9, ISO-8859-15 and
Windows-1252. Nevertheless there are no differences on the main
characters related to Portuguese so differences will hardly be made
and detection will usually return ISO-8859-1 only.
2016-09-21 00:01:07 +02:00
Jehan
ea2f4dd40f LangModels: new support for Latvian / ISO-8859-13.
Test text extracted from: https://lv.wikipedia.org/wiki/Vinsents_van_Gogs
2016-09-20 23:29:53 +02:00
Jehan
7cb3dd9ddd LangModels: add support for Lithuanian / ISO-8859-13.
Test text extracted from https://lt.wikipedia.org/wiki/Vincent_van_Gogh.
2016-09-20 23:09:24 +02:00
Jehan
923d264470 LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15).
Test for ISO-8859-1 is disabled for now since the difference is not big
enough, as for characters used in Danish, between ISO-8859-1 and
ISO-8859-15. Therefore the first to be declared "wins".
Let's see to improve this later.
Test contents from:
https://da.wikipedia.org/wiki/Eurosymbol
https://da.wikipedia.org/wiki/Dansk_%28sprog%29
2016-02-19 19:10:41 +01:00
Jehan
98b5e52252 LangModels: add VISCII encoding support and retrain Vietnamese model. 2016-02-13 03:51:18 +01:00
Jehan
178c6119b8 LangModels: add Windows-1258 support for Vietnamese.
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
2016-02-13 02:32:57 +01:00
Jehan
9c3c37517c LangModels: add Arabic support.
Models constructed for ISO-8859-6 and Windows-1256.
2015-12-13 18:42:16 +01:00
Jehan
ffabb65712 LangModels: adding Spanish support.
With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.
2015-12-12 18:54:35 +01:00
Jehan
fb3c47a073 LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models.
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
2015-12-04 03:14:52 +01:00
Jehan
5ee1c3ee39 LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9. 2015-12-04 02:35:09 +01:00
Jehan
f0e122b506 LangModels: add Esperanto ISO-8859-3 language model. 2015-12-04 01:35:56 +01:00
Jehan
aa587a64bd LangModels: adding German models for ISO-8859-1 and Windows-1252. 2015-12-03 23:58:41 +01:00
Jehan
0270b1e856 Adding French Windows-1252 support. 2015-12-03 21:22:30 +01:00
Jehan
683255278d Re-enable Hungarian language models.
Now that we have at least one model for ISO-8859-1, the risk of
detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.
2015-12-02 22:24:36 +01:00
Jehan
005fd98086 Add initial support for French with ISO-8859-1 and ISO-8859-15.
Mostly generated with a script from Wikipedia data (only the typical
positive ratio is slightly modified).
This is a first test before adding my generating script to the main tree.
2015-11-28 02:14:39 +01:00
BYVoid
84284eccf4 Update code from upstream. 2011-07-11 14:42:50 +08:00
BYVoid
3601900164 Initial release. 2011-07-10 15:04:42 +08:00