uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-01-01 03:12:24 +08:00

Author	SHA1	Message	Date
Jehan	7875272a8c	script, src, test: new Georgian support. For charsets UTF-8, GEORGIAN-ACADEMY and GEORGIAN-PS. The 2 GEORGIAN-* sets were generated thanks to the new create-table.py script. Test text comes from page 'ვირზაზუნა' page of Wikipedia in Georgian.	2022-12-20 14:28:29 +01:00
Jehan	d40e5868d5	script, src, test: adding Catalan support. For UTF-8, ISO-8859-1 and WINDOWS-1252 support. The test for UTF-8 and ISO-8859-1 is taken from 'Marmota' page on Wikipedia in Catalan. The test for WINDOWS-1252 is taken from the 'Unió_Europea' page. ISO-8859-1 and WINDOWS-1252 being very similar, regarding most letters (in particular the ones used in Catalan), I differentiated the test with a text containing the '€' symbol, which is on an unused spot in ISO-8859-1.	2022-12-20 01:46:15 +01:00
Jehan	0fe51d3851	Issue #21 : Greek CP737 support. It actually breaks "zh:big5" so I'm going to hold-off a bit. Adding more language and charset support is slowly starting to show the limitations of our legacy multi-byte charset supports, since I haven't really touched these since the original implementation of Mozilla. It might be time to start reviewing these parts of the code. The test file contents comes from 'Μαρμότα' page on Wikipedia in Greek (though since 2 letters are missing in this encoding, despite its popularity for Greek, I had to be careful in choosing pieces of text without such letters).	2022-12-18 22:33:12 +01:00
Jehan	abd123e07d	script, src, test: add Serbian support. For UTF-8, ISO-8859-5 and WINDOWS-1251. Test files' contents come from page 'Мрмот' on Wikipedia in Serbian.	2022-12-17 22:47:54 +01:00
Jehan	d00d4d52b7	src, script: add Macedonian support. For UTF-8, ISO-8859-5, WINDOWS-1251 and IBM855 encodings. Test files' contents come from page 'Хибернација' on Wikipedia in Macedonian.	2022-12-17 22:47:54 +01:00
Jehan	41d309e8a2	script, src: regenerate Russian models and add UTF-8/Russian support. This fixes the broken Russian test in Windows-1251 which once again gets a much better score with Russian. Also this adds UTF-8 support. Same as Bulgarian, I wonder why I had not regenerated this earlier. The new UTF-8 test comes from the 'Сурки' page of Wikipedia in Russian. Note that now this broke the test zh:gb18030 (the score for KOI8-R / ru (0.766388) beats GB18030 / zh (0.700000)). I think I'll have to look a bit closer at our GB18030 dedicated prober.	2022-12-17 21:41:11 +01:00
Jehan	60dcec8a82	script, src, test: add Ukrainian support. UTF-8 and Windows-1251 support for now. This actually breaks ru:windows-1251 test but same as Bulgarian, I never generated Russian models with my scripts, so the models we currently use are quite outdated. It will obviously be a lot better once we have new Russian models. The test file contents comes from 'Бабак' page on Wikipedia in Ukrainian.	2022-12-17 21:40:56 +01:00
Jehan	0fffc109b5	script, src, test: adding Belarusian support. Support for UTF-8, Windows-1251 and ISO-8859-5. The test contents comes from page 'Суркі' on Wikipedia in Belarusian.	2022-12-17 19:13:03 +01:00
Jehan	ffb94e4a9d	script, src, test: Bulgarian language models added. Not sure why we had the Bulgarian support but haven't recently updated it (i.e. never with the model generation script, or so it seems), especially with generic language models, allowing to have UTF-8/Bulgarian support. Maybe I tested it some time ago and it was getting bad results? Anyway now with all the recents updates on the confidence computation, I get very good detection scores. So adding support for UTF-8/Bulgarian and rebuilding other models too. Also adding a test for ISO-8859-5/Bulgarian (we already had support, but no test files). The 2 new test files are text from page 'Мармоти' on Wikipedia in Bulgarian language.	2022-12-17 18:41:00 +01:00
Jehan	0974920bdd	Issue #22 : Hebrew CP862 support. Added in both visual and logical order since Wikipedia says: > Hebrew text encoded using code page 862 was usually stored in visual > order; nevertheless, a few DOS applications, notably a word processor > named EinsteinWriter, stored Hebrew in logical order. I am not using the nsHebrewProber wrapper (nameProber) for this new support, because I am really unsure this is of any use. Our statistical code based on letter and sequence usage should be more than enough to detect both variants of Hebrew encoding already, and my testing show that so far (with pretty outstanding score on actual Hebrew tests while all the other probers return bad scores). This will have to be studied a bit more later and maybe the whole nsHebrewProber might be deleted, even for Windows-1255 charset. I'm also cleaning a bit nsSBCSGroupProber::nsSBCSGroupProber() code by incrementing a single index, instead of maintaining the indexes by hand (otherwise each time we add probers in the middle, to keep them logically gathered by languages, we have to manually increment dozens of following probers).	2022-12-16 23:27:52 +01:00
Jehan	6bb1b3e101	scripts: all language models rebuilt with the new ratio data.	2022-12-14 20:16:44 +01:00
Jehan	0be80a21db	script, src: update Norwegian model with the new language features. As I just rebased my branch about new language detection API, I needed to re-generate Norwegian language models. Unfortunately it doesn't detect UTF-8 Norwegian text, though not far off (it detects it as second candidate with high 91% confidence; beaten by Danish UTF-8 with 94% confidence unfortunately!). Note that I also update the alphabet list for Norwegian as there were too many letters in there (according to Wikipedia at least), so even when training a model, we had some missing characters in the training set.	2022-12-14 00:24:53 +01:00
Jehan	bfa4b10d4d	script, src: add English language model. English detection is still quite crappy so I don't add a unit test yet. Though I believe the detection being bad is mostly because of too much shortcutting we are doing to go "fast". I should probably review this whole part of the logics as well.	2022-12-14 00:24:53 +01:00
Jehan	314f062c70	script, src: regenerate the Thai model. With all the changes we made, regenerate the Thai model which is of poor quality. This new one is much better.	2022-12-14 00:24:53 +01:00
Jehan	338a51564a	src, script: add concept of alphabet_mapping in language models. This allows to handle cases where some characters are actually alternative/variants of another. For instance, a same word can be written with both variants, while both are considered correct and equivalent. Browsing a bit Slovenian Wikipedia, it looks like they only use them for titles there. I use this the first time on characters with diacritics in Slovene. Indeed these are so rarely used that they would hardly show in the stats and worse, any sequence using these in tested text would likely show as negative sequences hence drop the confidence in Slovenian. As a consequence, various Slovene text would show up as Slovak as it's close enough and contains the same character with diacritics in a common way.	2022-12-14 00:24:53 +01:00
Jehan	ba7d72e3b0	script: regenerate Slovak and Slovene with better alphabet support. I was missing some characters, especially in the Slovak alphabet. Oppositely the Slovene alphabet does not use 4 of the common ASCII alphabet.	2022-12-14 00:24:53 +01:00
Jehan	19737886fe	script, src: regenerate the Vietnamese model. The alphabet was not complete and thus confidence was a bit too low. For instance the VISCII test case's confidence bumped from 0.643401 to 0.696346 and the UTF-8 test case bumped from 0.863777 to 0.99. Only the Windows-1258 test case is slightly worse from 0.532846 to 0.532098. But the overwhole recognition gain is obvious anyway.	2022-12-14 00:24:53 +01:00
Jehan	b7acffc806	script, src: remove generated statistics data for Korean.	2022-12-14 00:24:53 +01:00
Jehan	a1b186fa8b	src: add Hindi/UTF-8 support.	2022-12-14 00:23:13 +01:00
Jehan	629bc879f3	script, src: add generic Korean model. Until now, Korean charsets had its own probers as there are no single-byte encoding for writing Korean. I now added a Korean model only for the generic character and sequence statistics. I also improved the generation script (script/BuildLangModel.py) to allow for languages without single-byte charset generation and to provide meaningful statistics even when the language script has a lot of characters (so we can't have a full sequence combination array, just too much data). It's not perfect yet. For instance our UTF-8 Korean test file ends up with confidence of 0.38503, which is low for obvious Korean text. Still it works (correctly detected, with top confidence compared to others) and is a first step toward more improvement for detection confidence.	2022-12-14 00:23:13 +01:00
Jehan	ded948ce15	script, src: generate the Hebrew models. The Hebrew Model had never been regenerated by my scripts. I now added the base generation files. Note that I added 2 charsets: ISO-8859-8 and WINDOWS-1255 but they are nearly identical. One of the difference is that the generic currency sign is replaced by the sheqel sign (Israel currency) in Windows-1255. And though this one lost the "double low line", apparently some Yiddish characters were added. Basically it looks like most Hebrew text would work fine with the same confidence on both charsets and detecting both is likely irrelevant. So I keep the charset file for ISO-8859-8, but won't actually use it. The good part is now that Hebrew is also recognized in UTF-8 text thanks to the new code and newly generated language model.	2022-12-14 00:23:13 +01:00
Jehan	388777be51	script, src, test: add IBM865 support for Danish. Newly added IBM865 charset (for Norwegian) can also be used for Danish By the way, I fixed `script/charsets/ibm865.py` as Danish uses the 'da' ISO 639-1 code by the way, not 'dk' (which is sometimes used for other codes for Denmark, such as ISO 3166 country code and internet TLD) but not for the language itself. For the test, adding some text from the top article of the day on the Danish Wikipedia, which was about Jimi Hendrix. And that's cool! 🎸 ;-)	2022-11-30 19:57:52 +01:00
Martin T. H. Sandsmark	099a9a4fd6	Add norwegian support	2022-11-30 19:09:09 +01:00
Jehan	119fed7e8d	LangModels: add Swedish support. Encodings: ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Test text from https://sv.wikipedia.org/wiki/Mölle	2016-09-28 22:42:13 +02:00
Jehan	d62154bd6e	LangModels: add Slovene support. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and MAC-CENTRALEUROPE. Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet	2016-09-28 22:13:17 +02:00
Jehan	fbd2efdbe9	LangModels: Romanian support added. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852. Test texts from https://ro.wikipedia.org/wiki/Danemarca	2016-09-28 19:57:50 +02:00
Jehan	0a04177787	script: forgot to commit the Estonian description.	2016-09-27 00:51:19 +02:00
Jehan	a7525b404d	LangModels: added support for Irish Gaelic. Encodings: ISO-8859-1, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Test text from: https://ga.wikipedia.org/wiki/Gluais_théarmaí_seoltóireachta	2016-09-27 00:49:05 +02:00
Jehan	3c6d31f5c2	LangModels: new Croatian models. Supports: ISO-8859-2, ISO-8859-13, ISO-8859-16, IBM852, Windows-1250 and MAC-CENTRALEUROPE. Test text from https://hr.wikipedia.org/wiki/Brekinja	2016-09-26 01:32:49 +02:00
Jehan	4e535503c6	script: language script for Slovak forgotten.	2016-09-21 18:58:12 +02:00
Jehan	f262b1d65b	LangModels: add Italian support. Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added since other encoding end up similar as ISO-8859-1 for most common texts (i.e. glyphs used in Italian are on the same codepoints on these other encodings). Test text from https://it.wikipedia.org/wiki/Architettura_longobarda	2016-09-21 18:52:09 +02:00
Jehan	6bbe7da1ac	LangModels: add Finnish support. I built models for ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-13, ISO-8859-15 and WINDOWS-1252, which all contain Finnish letters. Nevertheless most texts in these encoding end up the same (same codepoints for the Finnish glyphs) so I keep only tests for ISO-8859-1 and UTF-8. Models for other encoding may still be useful when processing texts with some symbols, etc.	2016-09-21 18:27:39 +02:00
Jehan	3401ac70d0	LangModels: add Polish support. With the following encodings: ISO-8859-2, ISO-8859-13, ISO-8859-16, Windows-1250, IBM852, MAC-CENTRALEUROPE. Test text from https://pl.wikipedia.org/wiki/Zofia_Holszańska	2016-09-21 17:30:15 +02:00
Jehan	26e1cebad1	LangModels: add support for Czech. Encodings: Windows-1250, ISO-8859-2, IBM852 and Mac-CentralEurope. Other encodings are known to have been used for Czech: Kamenicky, KOI-8 CS2 and Cork. But these are uncommon enough that I decided not to support them (especially since I can't find them supported in iconv either, or at least not under an alias which I could recognize). This web page, which contents was made under the Public Domain, is a good reference for encodings which were used historically for Czech and Slovak: http://luki.sdf-eu.org/txt/cs-encodings-faq.html	2016-09-21 03:33:50 +02:00
Jehan	2700cf3a83	LangModels: support for Maltese / ISO-8859-3. Test text from https://mt.wikipedia.org/wiki/Franza.	2016-09-21 02:11:31 +02:00
Jehan	b7aebfdfda	LangModels: add support for Latvian \| Lithuanian / ISO-8859-4 \| ISO-8859-10. Just realizing that these 2 language can also be encoded with these charsets (even though ISO-8859-13 would appear to be more common… maybe?). Anyway now the models are updated and can recognize texts using these encoding for these languages. Added some test files as well, which work great.	2016-09-21 00:27:16 +02:00
Jehan	e138839f07	LangModels: add support for Portuguese / ISO-8859-1. I actually added also couples with ISO-8859-9, ISO-8859-15 and Windows-1252. Nevertheless there are no differences on the main characters related to Portuguese so differences will hardly be made and detection will usually return ISO-8859-1 only.	2016-09-21 00:01:07 +02:00
Jehan	ea2f4dd40f	LangModels: new support for Latvian / ISO-8859-13. Test text extracted from: https://lv.wikipedia.org/wiki/Vinsents_van_Gogs	2016-09-20 23:29:53 +02:00
Jehan	7cb3dd9ddd	LangModels: add support for Lithuanian / ISO-8859-13. Test text extracted from https://lt.wikipedia.org/wiki/Vincent_van_Gogh.	2016-09-20 23:09:24 +02:00
Jehan	210e52d99a	LangModels: update the Greek language models. I did this to improve the model after a user reported a Greek sutitle badly detected (see commit e0eec3b). It didn't help, but well... since I updated it with much more data from Wikipedia. Let's just commit it!	2016-05-25 17:39:10 +02:00
Jehan	198190461e	script: move the Wikipedia title syntax cleaning to BuildLangModel.py.	2016-02-21 16:20:22 +01:00
Jehan	d24bd7d578	script: Wikipedia API's python wrapper does not return garbage text anymore. I can't see new commits since 2014. So I am assuming the issue was on Wikipedia side and that it has been fixed.	2016-02-21 16:07:10 +01:00
Jehan	923d264470	LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15). Test for ISO-8859-1 is disabled for now since the difference is not big enough, as for characters used in Danish, between ISO-8859-1 and ISO-8859-15. Therefore the first to be declared "wins". Let's see to improve this later. Test contents from: https://da.wikipedia.org/wiki/Eurosymbol https://da.wikipedia.org/wiki/Dansk_%28sprog%29	2016-02-19 19:10:41 +01:00
Jehan	98b5e52252	LangModels: add VISCII encoding support and retrain Vietnamese model.	2016-02-13 03:51:18 +01:00
Jehan	178c6119b8	LangModels: add Windows-1258 support for Vietnamese. I was planning on adding VISCII support as well, but Python encode() method does not have any support for it apparently, so I cannot generate the proper statistics data with the current version of the string.	2016-02-13 02:32:57 +01:00
Jehan	9c3c37517c	LangModels: add Arabic support. Models constructed for ISO-8859-6 and Windows-1256.	2015-12-13 18:42:16 +01:00
Jehan	ad2f7212e2	LangModels: retraining Greek models with my training script. This fixes our Greek/Windows-1253 test.	2015-12-13 18:02:11 +01:00
Jehan	ffabb65712	LangModels: adding Spanish support. With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.	2015-12-12 18:54:35 +01:00
Jehan	6b2722885a	BuildLangModel: forgot to add charset/language files.	2015-12-12 18:18:08 +01:00
Jehan	fb3c47a073	LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models. ISO-8859-11 is basically exactly identical to TIS-620, with the added non-breaking space character. Basically our detection will always return TIS-620 except for exceptional cases when a text has a non-breaking space.	2015-12-04 03:14:52 +01:00

1 2

60 Commits