uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-07 01:06:40 +08:00

Author	SHA1	Message	Date
Jehan	362086bf56	script: fix BuildLangModel.py.	2022-12-15 14:31:10 +01:00
Jehan	e311b64cd9	script: model-building script updated to produce the 2 new ratios… … introduced in previous commit.	2022-12-14 20:15:34 +01:00
Jehan	784f614c84	script: further fixing BuildLangModel.py.	2022-12-14 00:24:53 +01:00
Jehan	6365cad4fd	script: improve a bit the management of use_ascii option.	2022-12-14 00:24:53 +01:00
Jehan	81b83fffa9	script: work around recent issue of python wikipedia module. Adding `auto_suggest=False` to the wikipedia.page() call because this auto-suggest is completely broken, searching "mar ot" instead of "marmot" or "ground hug" instead of "Groundhog" (this one is extra funny but not so useful!). I actually wonder why it even needs to suggest anything when the Wikipedia pages do actually exist! Anyway the script BuildLangModel.py was very broken because of this, now it's better. See: https://github.com/goldsmith/Wikipedia/issues/295 Also printing the error message when we discard a page, which helps debugging.	2022-12-14 00:24:53 +01:00
Jehan	8e2cf7b81b	script: generate more complete frequent characters when range is set. The early version used to stop earlier, assuming frequent ranges were used only for language scripts with a lot of characters (such as Korean, or even more Japanese or Chinese), hence it was not efficient to keep data for them all. Since we now use a separate language detector for CJK, remaining scripts (so far) have a usable range of characters. Therefore it is much prefered to keep as much data as possible on these. This allowed to redo the Thai model (cf. previous commit) with more data, hence get much better language confidence on Thai texts.	2022-12-14 00:24:53 +01:00
Jehan	338a51564a	src, script: add concept of alphabet_mapping in language models. This allows to handle cases where some characters are actually alternative/variants of another. For instance, a same word can be written with both variants, while both are considered correct and equivalent. Browsing a bit Slovenian Wikipedia, it looks like they only use them for titles there. I use this the first time on characters with diacritics in Slovene. Indeed these are so rarely used that they would hardly show in the stats and worse, any sequence using these in tested text would likely show as negative sequences hence drop the confidence in Slovenian. As a consequence, various Slovene text would show up as Slovak as it's close enough and contains the same character with diacritics in a common way.	2022-12-14 00:24:53 +01:00
Jehan	adb158b058	script: fix a stupid bug making same ratio for all frequent characters. Argh! How did I miss this!	2022-12-14 00:24:53 +01:00
Jehan	a98cdcd88f	script: fix a bit BuildLangModel.py when use_ascii is True. In particular, I prepare the case for English detection. I am not pushing actual English models yet, because it's not so efficient yet. I will do when I will be able to handle better English confidence.	2022-12-14 00:23:13 +01:00
Jehan	629bc879f3	script, src: add generic Korean model. Until now, Korean charsets had its own probers as there are no single-byte encoding for writing Korean. I now added a Korean model only for the generic character and sequence statistics. I also improved the generation script (script/BuildLangModel.py) to allow for languages without single-byte charset generation and to provide meaningful statistics even when the language script has a lot of characters (so we can't have a full sequence combination array, just too much data). It's not perfect yet. For instance our UTF-8 Korean test file ends up with confidence of 0.38503, which is low for obvious Korean text. Still it works (correctly detected, with top confidence compared to others) and is a first step toward more improvement for detection confidence.	2022-12-14 00:23:13 +01:00
Jehan	b70b1ebf88	Rebuild a bunch of language models. Adding generic language model (see coming commit), which uses the same data as specific single-byte encoding statistics model, except that it applies it to unicode code points. For this to work, instead of the CharToOrderMap which was mapping directly from encoded byte (always 256 values) to order, now we add an array of frequent characters, ordered by generic unicode code points to the order of frequency (which can be used on the same sequence mapping array). This of course means that each prober where we will want to use these generic models will have to implement their own byte to code point decoder, as this is per-encoding logics anyway. This will come in a subsequent commit.	2022-12-14 00:23:13 +01:00
Jehan	c550af99a7	script: update BuildLangModel.py to updated SequenceModel struct. In particular, there is now a language code member.	2022-12-14 00:23:13 +01:00
Jehan	5aa628272b	script: fix small issues with commits e41e8a4 and 8d15d6b.	2022-11-30 19:24:28 +01:00
Martin T. H. Sandsmark	e41e8a47e4	improve model building script a bit	2022-11-30 19:09:09 +01:00
Martin T. H. Sandsmark	8d15d6b557	make the logfile usable	2022-11-30 19:09:09 +01:00
Jehan	98bc2f31ef	Issue #8 : have BuildLangModel.py add ending newline to generated source.	2020-04-22 22:57:25 +02:00
Jehan	d76d33b88b	script: character orders in single-byte language models should be maxed. This happened when building a Croatian model which can be written with many different encodings. There were also many irrelevant glyphs (i.e. used in other languages) in these encodings so we ended with orders over 255, which breaks when converting to unsigned char. Just let's make sure that we don't cross the 250 limit (over is used for controls, illegal characters, symbols, numbers…). This means we may have several characters with order 249, but since orders over the frequent character list don't matter, this is not a problem.	2016-09-26 01:31:20 +02:00
Jehan	26024e5c82	script: work around a KeyError exception in Python Wikipedia lib. Even the test `if hasattr(page, 'links')` would trigger this exception. So I try the approach "Easier to Ask Forgiveness than Permission". Weird stuff but well… Note: I had this exception when running it on the Maltese data.	2016-09-21 02:19:39 +02:00
Jehan	6cd8c322ad	script: stupid bug on BuildLangModel.py.	2016-05-25 15:23:36 +02:00
Jehan	198190461e	script: move the Wikipedia title syntax cleaning to BuildLangModel.py.	2016-02-21 16:20:22 +01:00
Jehan	600cf76a76	BuildLangModel: try using iconv for conversion when support missing... ... in python. For instance I had the case where the VISCII encoding is supported by iconv but not by encode/decode() function in core python.	2016-02-13 03:47:41 +01:00
Jehan	27135a8880	BuildLangModel: printing a message when discarding a page.	2016-02-13 02:27:15 +01:00
Jehan	055332ac7d	BuildLangModel: allow the alphabet list to be written in string format.	2015-12-12 18:50:29 +01:00
Jehan	7b4eb9827e	BuildLangModel: add an exception handler on charset spec errors.	2015-12-12 18:00:30 +01:00
Jehan	22b9ed2d4f	BuildLangModel: add concept of custom_case_mapping… … for langs for which Python lower() algorithm fails. In particular Turkish dotted/dotless 'i' does not follow same rules as common western languages. Lowercase for 'I' is indeed not 'i' but 'ı'. Uppercase for 'i' is indeed not 'I' but 'İ'.	2015-12-04 02:29:40 +01:00
Jehan	a167bd5e42	BuildLangModel: lowercase only when resulting char has a composed form. I had the case with the Turkish dotted 'İ' that lowercasing it with Python algorithm returned me a decomposed character that it was not able to recompose. Therefore ord() raised a TypeError because the string length was 2.	2015-12-04 01:30:21 +01:00
Jehan	dc5caa46bc	BuildLangModel: fix hardcoded file names.	2015-11-30 19:18:25 +01:00
Jehan	3e5d37a6b5	BuildLangModel: process pages level per level. I.e. horizontally or "breadth first" rather than vertical tree traversal. This allows to make sure all the start pages in particular are searched, when using max_page option.	2015-11-30 19:12:04 +01:00
Jehan	d9d347099e	BuildLangModel: fix some minor comment from a previous spec.	2015-11-30 00:09:23 +01:00
Jehan	192f8de165	BuildLangModel: build models with computed frequent characters count.	2015-11-30 00:04:44 +01:00
Jehan	b64831ff89	BuildLangModel: allow a list of start pages... ... and add a page with a word with œ in French to make sure we have such words in our stats.	2015-11-29 15:51:23 +01:00
Jehan	dce79a6631	BuildLangModel: the SequenceModel naming must include the language name.	2015-11-29 15:49:56 +01:00
Jehan	c59465adfc	BuildLangModel: save lang model directly in the right directory.	2015-11-29 13:26:10 +01:00
Jehan	290fbd2e2e	BuildLangModel: add the licensing header to generated files.	2015-11-29 02:26:33 +01:00
Jehan	7f290975ba	BuildLangModel: map different cases of the same character together. With the new case_mapping lang property, we can consider upper and lower case versions of the same character as one character. This makes sense in some language, and would allow to enter some rarer characters (but still in the main alphabet) inside the frequent character list. For instance 'œ' and 'Œ' in French.	2015-11-29 02:14:48 +01:00
Jehan	00a78faa1d	BuildLangModel: the max_depth should be a script option... ... rather than a language property.	2015-11-29 01:59:28 +01:00
Jehan	274386f424	BuildLangModel: add a --max-page option to limit data size. This is mostly useful for debugging while we don't want to wait forever to test the script.	2015-11-29 01:42:36 +01:00
Jehan	0314f98ece	BuildLangModel.py: some in-progress script to build language models.	2015-11-29 01:30:04 +01:00

38 Commits