uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-02-05 17:30:09 +08:00

Author	SHA1	Message	Date
Jehan	210e52d99a	LangModels: update the Greek language models. I did this to improve the model after a user reported a Greek sutitle badly detected (see commit e0eec3b). It didn't help, but well... since I updated it with much more data from Wikipedia. Let's just commit it!	2016-05-25 17:39:10 +02:00
Jehan	6cd8c322ad	script: stupid bug on BuildLangModel.py.	2016-05-25 15:23:36 +02:00
Jehan	198190461e	script: move the Wikipedia title syntax cleaning to BuildLangModel.py.	2016-02-21 16:20:22 +01:00
Jehan	d24bd7d578	script: Wikipedia API's python wrapper does not return garbage text anymore. I can't see new commits since 2014. So I am assuming the issue was on Wikipedia side and that it has been fixed.	2016-02-21 16:07:10 +01:00
Jehan	37024460fe	script: add a README file dedicated to adding new support.	2016-02-21 16:06:11 +01:00
Jehan	923d264470	LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15). Test for ISO-8859-1 is disabled for now since the difference is not big enough, as for characters used in Danish, between ISO-8859-1 and ISO-8859-15. Therefore the first to be declared "wins". Let's see to improve this later. Test contents from: https://da.wikipedia.org/wiki/Eurosymbol https://da.wikipedia.org/wiki/Dansk_%28sprog%29	2016-02-19 19:10:41 +01:00
Jehan	98b5e52252	LangModels: add VISCII encoding support and retrain Vietnamese model.	2016-02-13 03:51:18 +01:00
Jehan	600cf76a76	BuildLangModel: try using iconv for conversion when support missing... ... in python. For instance I had the case where the VISCII encoding is supported by iconv but not by encode/decode() function in core python.	2016-02-13 03:47:41 +01:00
Jehan	178c6119b8	LangModels: add Windows-1258 support for Vietnamese. I was planning on adding VISCII support as well, but Python encode() method does not have any support for it apparently, so I cannot generate the proper statistics data with the current version of the string.	2016-02-13 02:32:57 +01:00
Jehan	27135a8880	BuildLangModel: printing a message when discarding a page.	2016-02-13 02:27:15 +01:00
Jehan	9c3c37517c	LangModels: add Arabic support. Models constructed for ISO-8859-6 and Windows-1256.	2015-12-13 18:42:16 +01:00
Jehan	ad2f7212e2	LangModels: retraining Greek models with my training script. This fixes our Greek/Windows-1253 test.	2015-12-13 18:02:11 +01:00
Jehan	ffabb65712	LangModels: adding Spanish support. With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.	2015-12-12 18:54:35 +01:00
Jehan	055332ac7d	BuildLangModel: allow the alphabet list to be written in string format.	2015-12-12 18:50:29 +01:00
Jehan	6b2722885a	BuildLangModel: forgot to add charset/language files.	2015-12-12 18:18:08 +01:00
Jehan	7b4eb9827e	BuildLangModel: add an exception handler on charset spec errors.	2015-12-12 18:00:30 +01:00
Jehan	569509f844	BuildLangModel: forgot to add logs for Thai models generation.	2015-12-04 03:26:52 +01:00
Jehan	fb3c47a073	LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models. ISO-8859-11 is basically exactly identical to TIS-620, with the added non-breaking space character. Basically our detection will always return TIS-620 except for exceptional cases when a text has a non-breaking space.	2015-12-04 03:14:52 +01:00
Jehan	ffcd85f709	script: forgot to commit ISO-8859-9 and Turkish files.	2015-12-04 02:40:54 +01:00
Jehan	5ee1c3ee39	LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9.	2015-12-04 02:35:09 +01:00
Jehan	22b9ed2d4f	BuildLangModel: add concept of custom_case_mapping… … for langs for which Python lower() algorithm fails. In particular Turkish dotted/dotless 'i' does not follow same rules as common western languages. Lowercase for 'I' is indeed not 'i' but 'ı'. Uppercase for 'i' is indeed not 'I' but 'İ'.	2015-12-04 02:29:40 +01:00
Jehan	f0e122b506	LangModels: add Esperanto ISO-8859-3 language model.	2015-12-04 01:35:56 +01:00
Jehan	a167bd5e42	BuildLangModel: lowercase only when resulting char has a composed form. I had the case with the Turkish dotted 'İ' that lowercasing it with Python algorithm returned me a decomposed character that it was not able to recompose. Therefore ord() raised a TypeError because the string length was 2.	2015-12-04 01:30:21 +01:00
Jehan	aa587a64bd	LangModels: adding German models for ISO-8859-1 and Windows-1252.	2015-12-03 23:58:41 +01:00
Jehan	0270b1e856	Adding French Windows-1252 support.	2015-12-03 21:22:30 +01:00
Jehan	9cb5764b73	LangModels: update the French language models. Fully built with the script.	2015-11-30 19:20:55 +01:00
Jehan	dc5caa46bc	BuildLangModel: fix hardcoded file names.	2015-11-30 19:18:25 +01:00
Jehan	3e5d37a6b5	BuildLangModel: process pages level per level. I.e. horizontally or "breadth first" rather than vertical tree traversal. This allows to make sure all the start pages in particular are searched, when using max_page option.	2015-11-30 19:12:04 +01:00
Jehan	d9d347099e	BuildLangModel: fix some minor comment from a previous spec.	2015-11-30 00:09:23 +01:00
Jehan	192f8de165	BuildLangModel: build models with computed frequent characters count.	2015-11-30 00:04:44 +01:00
Jehan	429448199f	French language model: fix a start page. Because of a bug in the Wikipedia querying Python library.	2015-11-29 23:55:03 +01:00
Jehan	b64831ff89	BuildLangModel: allow a list of start pages... ... and add a page with a word with œ in French to make sure we have such words in our stats.	2015-11-29 15:51:23 +01:00
Jehan	dce79a6631	BuildLangModel: the SequenceModel naming must include the language name.	2015-11-29 15:49:56 +01:00
Jehan	c59465adfc	BuildLangModel: save lang model directly in the right directory.	2015-11-29 13:26:10 +01:00
Jehan	290fbd2e2e	BuildLangModel: add the licensing header to generated files.	2015-11-29 02:26:33 +01:00
Jehan	7f290975ba	BuildLangModel: map different cases of the same character together. With the new case_mapping lang property, we can consider upper and lower case versions of the same character as one character. This makes sense in some language, and would allow to enter some rarer characters (but still in the main alphabet) inside the frequent character list. For instance 'œ' and 'Œ' in French.	2015-11-29 02:14:48 +01:00
Jehan	00a78faa1d	BuildLangModel: the max_depth should be a script option... ... rather than a language property.	2015-11-29 01:59:28 +01:00
Jehan	274386f424	BuildLangModel: add a --max-page option to limit data size. This is mostly useful for debugging while we don't want to wait forever to test the script.	2015-11-29 01:42:36 +01:00
Jehan	0314f98ece	BuildLangModel.py: some in-progress script to build language models.	2015-11-29 01:30:04 +01:00
BYVoid	56a4c0d86c	Add authors.	2011-07-13 20:16:23 +08:00

40 Commits