uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-08 01:36:41 +08:00

Author	SHA1	Message	Date
Jehan	60dcec8a82	script, src, test: add Ukrainian support. UTF-8 and Windows-1251 support for now. This actually breaks ru:windows-1251 test but same as Bulgarian, I never generated Russian models with my scripts, so the models we currently use are quite outdated. It will obviously be a lot better once we have new Russian models. The test file contents comes from 'Бабак' page on Wikipedia in Ukrainian.	2022-12-17 21:40:56 +01:00
Jehan	0fffc109b5	script, src, test: adding Belarusian support. Support for UTF-8, Windows-1251 and ISO-8859-5. The test contents comes from page 'Суркі' on Wikipedia in Belarusian.	2022-12-17 19:13:03 +01:00
Jehan	0974920bdd	Issue #22 : Hebrew CP862 support. Added in both visual and logical order since Wikipedia says: > Hebrew text encoded using code page 862 was usually stored in visual > order; nevertheless, a few DOS applications, notably a word processor > named EinsteinWriter, stored Hebrew in logical order. I am not using the nsHebrewProber wrapper (nameProber) for this new support, because I am really unsure this is of any use. Our statistical code based on letter and sequence usage should be more than enough to detect both variants of Hebrew encoding already, and my testing show that so far (with pretty outstanding score on actual Hebrew tests while all the other probers return bad scores). This will have to be studied a bit more later and maybe the whole nsHebrewProber might be deleted, even for Windows-1255 charset. I'm also cleaning a bit nsSBCSGroupProber::nsSBCSGroupProber() code by incrementing a single index, instead of maintaining the indexes by hand (otherwise each time we add probers in the middle, to keep them logically gathered by languages, we have to manually increment dozens of following probers).	2022-12-16 23:27:52 +01:00
Jehan	b5b75b81ce	script, src: rebuild the Danish model. Now that it has IBM865 support on the main branch and that I rebased, this feature branch for the new API got broken too.	2022-12-14 00:24:53 +01:00
Jehan	bfa4b10d4d	script, src: add English language model. English detection is still quite crappy so I don't add a unit test yet. Though I believe the detection being bad is mostly because of too much shortcutting we are doing to go "fast". I should probably review this whole part of the logics as well.	2022-12-14 00:24:53 +01:00
Jehan	2127f4fc0d	src: allow for nsCharSetProber to return several candidates. No functional change yet because all probers still return 1 candidate. Yet now we add a GetCandidates() method to return a number of candidates. GetCharSetName(), GetLanguage() and GetConfidence() now take a parameter which is the candidate index (which must be below the return value of GetCandidates()). We can now consider that nsCharSetProber computes a couple (charset, language) and that the confidence is for this specific couple, not just the confidence for charset detection.	2022-12-14 00:23:13 +01:00
Jehan	5257fc1abf	Using the generic language detector in UTF-8 detection. Now the UTF-8 prober would not only detect valid UTF-8, but would also detect the most probable language. Using the data generated 2 commits away, this works very well. This is still basic and will require even more improvements. In particular, now the nsUTF8Prober should return an array of ("UTF-8", language) couple candidate. And nsMBCSGroupProber should itself forward these candidates as well as other candidates from other multi-byte detectors. This way, the public-facing API would get more probable candidates, in case the algorithm is slightly wrong. Also the UTF-8 confidence is currently stupidly high as soon as we consider it to be right. We should likely weigh it with language detection (in particular, if no language is detected, this should severely weigh down UTF-8 detection; not to 0, but high enough to be a fallback in case no other encoding+lang is valid and low enough to give chances to other good candidate couples.	2022-12-14 00:23:13 +01:00
Jehan	5a949265d5	src: new API to get the detected language. This doesn't work for all probers yet, in particular not for the most generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL. It's still a good first step. Right now, it returns the 2-character language code from ISO 639-1. A using project could easily get the English language name from the XML/json files provided by the iso-codes project. This project will also allow to easily localize the language name in other languages through gettext (this is what we do in GIMP for instance). I don't add any dependency though and leave it to downstream projects to implement this. I was also wondering if we want to support region information for cases when it would make sense. I especially wondered about it for Chinese encodings as some of them seem quite specific to a region (according to Wikipedia at least). For the time being though, these just return "zh". We'll see later if it makes sense to be more accurate (maybe depending on reports?).	2022-12-14 00:23:13 +01:00
Jehan	388777be51	script, src, test: add IBM865 support for Danish. Newly added IBM865 charset (for Norwegian) can also be used for Danish By the way, I fixed `script/charsets/ibm865.py` as Danish uses the 'da' ISO 639-1 code by the way, not 'dk' (which is sometimes used for other codes for Denmark, such as ISO 3166 country code and internet TLD) but not for the language itself. For the test, adding some text from the top article of the day on the Danish Wikipedia, which was about Jimi Hendrix. And that's cool! 🎸 ;-)	2022-11-30 19:57:52 +01:00
Martin T. H. Sandsmark	099a9a4fd6	Add norwegian support	2022-11-30 19:09:09 +01:00
Jehan	64efb1b24c	Bug 101031 - Memory leak of nsSBCSGroupProber. This manual incrementation code is just horrible and so error-prone. Some day, we should make a cleaner loop to register all these single-byte charset probers.	2017-05-14 18:24:11 +02:00
Jehan	119fed7e8d	LangModels: add Swedish support. Encodings: ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Test text from https://sv.wikipedia.org/wiki/Mölle	2016-09-28 22:42:13 +02:00
Jehan	d62154bd6e	LangModels: add Slovene support. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and MAC-CENTRALEUROPE. Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet	2016-09-28 22:13:17 +02:00
Jehan	fbd2efdbe9	LangModels: Romanian support added. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852. Test texts from https://ro.wikipedia.org/wiki/Danemarca	2016-09-28 19:57:50 +02:00
Jehan	a7525b404d	LangModels: added support for Irish Gaelic. Encodings: ISO-8859-1, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Test text from: https://ga.wikipedia.org/wiki/Gluais_théarmaí_seoltóireachta	2016-09-27 00:49:05 +02:00
Jehan	a3a271dfd5	LangModels: Estonian models created. Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and Windows-1257. Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov Windows-1257 and ISO-8859-13 are very close so I added quotation marks (Jutumärgid) which are on codepoints only present in ISO-8859-13, making both encoding apart.	2016-09-27 00:14:29 +02:00
Jehan	3c6d31f5c2	LangModels: new Croatian models. Supports: ISO-8859-2, ISO-8859-13, ISO-8859-16, IBM852, Windows-1250 and MAC-CENTRALEUROPE. Test text from https://hr.wikipedia.org/wiki/Brekinja	2016-09-26 01:32:49 +02:00
Jehan	05ba8555cd	src: fix number of Single-Byte charset probers.	2016-09-25 14:02:39 +02:00
Jehan	f262b1d65b	LangModels: add Italian support. Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added since other encoding end up similar as ISO-8859-1 for most common texts (i.e. glyphs used in Italian are on the same codepoints on these other encodings). Test text from https://it.wikipedia.org/wiki/Architettura_longobarda	2016-09-21 18:52:09 +02:00
Jehan	6bbe7da1ac	LangModels: add Finnish support. I built models for ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-13, ISO-8859-15 and WINDOWS-1252, which all contain Finnish letters. Nevertheless most texts in these encoding end up the same (same codepoints for the Finnish glyphs) so I keep only tests for ISO-8859-1 and UTF-8. Models for other encoding may still be useful when processing texts with some symbols, etc.	2016-09-21 18:27:39 +02:00
Jehan	3401ac70d0	LangModels: add Polish support. With the following encodings: ISO-8859-2, ISO-8859-13, ISO-8859-16, Windows-1250, IBM852, MAC-CENTRALEUROPE. Test text from https://pl.wikipedia.org/wiki/Zofia_Holszańska	2016-09-21 17:30:15 +02:00
Jehan	5f9ec3aef0	LangModels: add support for Slovak. Encodings are the same as Czech (Windows-1250, ISO-8859-2 and Mac-CentralEurope) since the resource I found indicate they used the same encodings historically. Also it is to be noted that the test examples' encoding were already properly detected through Czech's models so the languages are definitely very close, even statistically. Nevertheless adding the right models will work better and these get better scores. This will take all its meaning when uchardet will also be used as a language detector (in some not-too-far future, hopefully!). Test text taken from: https://sk.wikipedia.org/wiki/Jupiter	2016-09-21 13:42:20 +02:00
Jehan	26e1cebad1	LangModels: add support for Czech. Encodings: Windows-1250, ISO-8859-2, IBM852 and Mac-CentralEurope. Other encodings are known to have been used for Czech: Kamenicky, KOI-8 CS2 and Cork. But these are uncommon enough that I decided not to support them (especially since I can't find them supported in iconv either, or at least not under an alias which I could recognize). This web page, which contents was made under the Public Domain, is a good reference for encodings which were used historically for Czech and Slovak: http://luki.sdf-eu.org/txt/cs-encodings-faq.html	2016-09-21 03:33:50 +02:00
Jehan	2700cf3a83	LangModels: support for Maltese / ISO-8859-3. Test text from https://mt.wikipedia.org/wiki/Franza.	2016-09-21 02:11:31 +02:00
Jehan	b7aebfdfda	LangModels: add support for Latvian \| Lithuanian / ISO-8859-4 \| ISO-8859-10. Just realizing that these 2 language can also be encoded with these charsets (even though ISO-8859-13 would appear to be more common… maybe?). Anyway now the models are updated and can recognize texts using these encoding for these languages. Added some test files as well, which work great.	2016-09-21 00:27:16 +02:00
Jehan	e138839f07	LangModels: add support for Portuguese / ISO-8859-1. I actually added also couples with ISO-8859-9, ISO-8859-15 and Windows-1252. Nevertheless there are no differences on the main characters related to Portuguese so differences will hardly be made and detection will usually return ISO-8859-1 only.	2016-09-21 00:01:07 +02:00
Jehan	ea2f4dd40f	LangModels: new support for Latvian / ISO-8859-13. Test text extracted from: https://lv.wikipedia.org/wiki/Vinsents_van_Gogs	2016-09-20 23:29:53 +02:00
Jehan	7cb3dd9ddd	LangModels: add support for Lithuanian / ISO-8859-13. Test text extracted from https://lt.wikipedia.org/wiki/Vincent_van_Gogh.	2016-09-20 23:09:24 +02:00
Jehan	923d264470	LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15). Test for ISO-8859-1 is disabled for now since the difference is not big enough, as for characters used in Danish, between ISO-8859-1 and ISO-8859-15. Therefore the first to be declared "wins". Let's see to improve this later. Test contents from: https://da.wikipedia.org/wiki/Eurosymbol https://da.wikipedia.org/wiki/Dansk_%28sprog%29	2016-02-19 19:10:41 +01:00
Jehan	98b5e52252	LangModels: add VISCII encoding support and retrain Vietnamese model.	2016-02-13 03:51:18 +01:00
Jehan	178c6119b8	LangModels: add Windows-1258 support for Vietnamese. I was planning on adding VISCII support as well, but Python encode() method does not have any support for it apparently, so I cannot generate the proper statistics data with the current version of the string.	2016-02-13 02:32:57 +01:00
Jehan	9c3c37517c	LangModels: add Arabic support. Models constructed for ISO-8859-6 and Windows-1256.	2015-12-13 18:42:16 +01:00
Jehan	ffabb65712	LangModels: adding Spanish support. With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.	2015-12-12 18:54:35 +01:00
Jehan	fb3c47a073	LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models. ISO-8859-11 is basically exactly identical to TIS-620, with the added non-breaking space character. Basically our detection will always return TIS-620 except for exceptional cases when a text has a non-breaking space.	2015-12-04 03:14:52 +01:00
Jehan	5ee1c3ee39	LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9.	2015-12-04 02:35:09 +01:00
Jehan	f0e122b506	LangModels: add Esperanto ISO-8859-3 language model.	2015-12-04 01:35:56 +01:00
Jehan	aa587a64bd	LangModels: adding German models for ISO-8859-1 and Windows-1252.	2015-12-03 23:58:41 +01:00
Jehan	0270b1e856	Adding French Windows-1252 support.	2015-12-03 21:22:30 +01:00
Jehan	683255278d	Re-enable Hungarian language models. Now that we have at least one model for ISO-8859-1, the risk of detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.	2015-12-02 22:24:36 +01:00
Jehan	005fd98086	Add initial support for French with ISO-8859-1 and ISO-8859-15. Mostly generated with a script from Wikipedia data (only the typical positive ratio is slightly modified). This is a first test before adding my generating script to the main tree.	2015-11-28 02:14:39 +01:00
BYVoid	84284eccf4	Update code from upstream.	2011-07-11 14:42:50 +08:00
BYVoid	3601900164	Initial release.	2011-07-10 15:04:42 +08:00

42 Commits