uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-02-16 23:30:00 +08:00

Author	SHA1	Message	Date
Jehan	886e03a523	Release: version 0.0.5. v0.0.5	2015-12-04 22:45:26 +01:00
Jehan	fe7bf3e994	test: update UTF-16 and UTF-32 tests after label changing.	2015-12-04 19:46:51 +01:00
Jehan	e5234d6b61	Stating endianness of UTF-16 and UTF-32 was an error when BOM present. According to RFC 2781, section 3.3: "Systems labelling UTF-16BE/LE text MUST NOT prepend a BOM to the text." Since uchardet cannot (and should not, obviously, it's not its role) modify input text, when a BOM is present, we should always label the encoding as "UTF-16" only. Also it broke unit tests in using programs since a conversion from UTF-8 to UTF-16LE/BE would create a text without BOM, and a conversion from UTF-16LE/BE to UTF-8 creates a UTF-8 text with a BOM, which changed existing behaviours. Same goes for UTF-32. See also Unicode 5.0.0 standard, section 3.10 (tables 3.8 and 3.9 in particular).	2015-12-04 19:19:39 +01:00
Jehan	2856e68aac	README: reorganize support list by alphabetic order. (Except for "International" and "Others")	2015-12-04 03:33:22 +01:00
Jehan	5691dc59a1	LangModels: rename Cyrillic models to Russian models. Our language models are per-lang, not per script.	2015-12-04 03:27:29 +01:00
Jehan	569509f844	BuildLangModel: forgot to add logs for Thai models generation.	2015-12-04 03:26:52 +01:00
Jehan	dc03ea002f	README: supports are per-language rather than per script system. In particular separate "Cyrillic" into "Russian" and "Bulgarian" (currently our only 2 supported languages using Cyrillic script).	2015-12-04 03:22:05 +01:00
Jehan	fb3c47a073	LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models. ISO-8859-11 is basically exactly identical to TIS-620, with the added non-breaking space character. Basically our detection will always return TIS-620 except for exceptional cases when a text has a non-breaking space.	2015-12-04 03:14:52 +01:00
Jehan	ffcd85f709	script: forgot to commit ISO-8859-9 and Turkish files.	2015-12-04 02:40:54 +01:00
Jehan	5ee1c3ee39	LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9.	2015-12-04 02:35:09 +01:00
Jehan	22b9ed2d4f	BuildLangModel: add concept of custom_case_mapping… … for langs for which Python lower() algorithm fails. In particular Turkish dotted/dotless 'i' does not follow same rules as common western languages. Lowercase for 'I' is indeed not 'i' but 'ı'. Uppercase for 'i' is indeed not 'I' but 'İ'.	2015-12-04 02:29:40 +01:00
Jehan	f0e122b506	LangModels: add Esperanto ISO-8859-3 language model.	2015-12-04 01:35:56 +01:00
Jehan	a167bd5e42	BuildLangModel: lowercase only when resulting char has a composed form. I had the case with the Turkish dotted 'İ' that lowercasing it with Python algorithm returned me a decomposed character that it was not able to recompose. Therefore ord() raised a TypeError because the string length was 2.	2015-12-04 01:30:21 +01:00
Jehan	b56a3c7b84	README: add German support.	2015-12-04 00:07:03 +01:00
Jehan	55b4f23971	Single Byte charsets: high ctrl character ratio lowers confidence. Control characters are not an error per-se. Nevertheless they are clearly not frequent in single-byte charset texts. It is only normal for them to lower confidence in a charset. In particular a higher ctrl-per-letter ratio means a lower confidence. This fixes for instance our Windows-1252 German test (otherwise detected as ISO-8859-1).	2015-12-04 00:04:43 +01:00
Jehan	aa587a64bd	LangModels: adding German models for ISO-8859-1 and Windows-1252.	2015-12-03 23:58:41 +01:00
Jehan	90728e4068	README: update with Windows-1252 support information.	2015-12-03 21:25:53 +01:00
Jehan	0270b1e856	Adding French Windows-1252 support.	2015-12-03 21:22:30 +01:00
Jehan	5d3fb3dc2f	test: add a Windows-1252 French test. Text from https://fr.wikipedia.org/wiki/Œuf_(cuisine)	2015-12-03 21:20:15 +01:00
Jehan	15afc5c593	test: add a Hungarian Windows-1250 test but skip it for now. Text from: https://hu.wikipedia.org/wiki/Magyar_nyelv	2015-12-03 21:18:55 +01:00
Jehan	ea34e8b1bd	Update doc comment. We do not return empty string on ASCII anymore. It means only detection failure, now. ASCII will get a proper "ASCII" return.	2015-12-03 20:36:09 +01:00
Jehan	60f641bf37	Update README to mark independence with original Mozilla code.	2015-12-03 20:32:57 +01:00
Jehan	e4260f4a39	Release: version 0.0.4. v0.0.4	2015-12-03 19:48:58 +01:00
Jehan	ba56d91808	Update uchardet URL in various places.	2015-12-03 19:48:29 +01:00
Jehan	d1bc09e4d7	Update authors. I think I deserved being listed in the authors by now. ;-)	2015-12-03 19:44:13 +01:00
Jehan	c4fa728e7a	Merge branch 'master' of https://github.com/lovasoa/uchardet into lovasoa-master Let's shortcut Single Byte charset detection on invalid codepoints. Merging and fixing the contributor's commit conflicts after code redesign: in particular we added an illegal character concept (they were mixed with control characters in current charmaps. Yet ctrl characters are NOT to be considered invalid) and constants instead of hardcoded numbers ('ILL' rather than 255).	2015-12-03 19:26:19 +01:00
Jehan	d686fcc1cd	LangModels: add illegal codepoints information on single byte charmaps.	2015-12-03 19:04:07 +01:00
Jehan	683255278d	Re-enable Hungarian language models. Now that we have at least one model for ISO-8859-1, the risk of detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.	2015-12-02 22:24:36 +01:00
Jehan	f4f9fc3f28	test: reenable Windows-1251 test for Russian. Commit 4f1c3ff actually fixed it!	2015-12-02 21:53:27 +01:00
Jehan	9dd6b34e93	test: add French UTF-8 test. Text from: https://fr.wikipedia.org/wiki/UTF-8	2015-11-30 20:03:33 +01:00
Jehan	4f1c3ff85e	nsSBCharSetProber: multiply confidence by ratio of positive seqs per chars. If all sequences in a text are positive sequences, the ratio of positive sequences cannot make the difference between 2 very close charsets. A ratio of positive sequences per letters on the other hand will change a tie between 2 encoding. If while adding a letter, the number of positive sequences does not increase, the confidence will decrease (corresponding to the fact it was likely not a letter). On the other hand, if the number of positive sequences increase, so will the confidence. For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15. When letters only available in ISO-8859-15 appear in a text, we expect confidence to tilt towards the close yet slightly different ISO-8859-15.	2015-11-30 19:52:07 +01:00
Jehan	9cb5764b73	LangModels: update the French language models. Fully built with the script.	2015-11-30 19:20:55 +01:00
Jehan	dc5caa46bc	BuildLangModel: fix hardcoded file names.	2015-11-30 19:18:25 +01:00
Jehan	3e5d37a6b5	BuildLangModel: process pages level per level. I.e. horizontally or "breadth first" rather than vertical tree traversal. This allows to make sure all the start pages in particular are searched, when using max_page option.	2015-11-30 19:12:04 +01:00
Jehan	04f9309932	tests: update ISO-8859-15 French test file. Previous technical text about charsets themselves were not relevant to identify a language. In particular the special characters different between ISO-8859-1 and ISO-8859-15 were used by themselves, out of a char sequence context. Therefore without language understanding, they could have as well been representing the ISO-8859-15 letters or the ISO-8859-1 symbols at the corresponding codepoints. Replacing with text from this Wikipedia page: https://fr.wikipedia.org/wiki/Œuf_(cuisine) This uses some of these same characters (in particular 'œ') but in contextual character sequences, making it relevant for our algorithm.	2015-11-30 00:19:15 +01:00
Jehan	d9d347099e	BuildLangModel: fix some minor comment from a previous spec.	2015-11-30 00:09:23 +01:00
Jehan	192f8de165	BuildLangModel: build models with computed frequent characters count.	2015-11-30 00:04:44 +01:00
Jehan	429448199f	French language model: fix a start page. Because of a bug in the Wikipedia querying Python library.	2015-11-29 23:55:03 +01:00
Jehan	dbb4c1d2ff	nsSBCharSetProber: replace the fixed 64 SAMPLE_SIZE... ... with per-language model "frequent character" count.	2015-11-29 23:51:55 +01:00
Jehan	b64831ff89	BuildLangModel: allow a list of start pages... ... and add a page with a word with œ in French to make sure we have such words in our stats.	2015-11-29 15:51:23 +01:00
Jehan	dce79a6631	BuildLangModel: the SequenceModel naming must include the language name.	2015-11-29 15:49:56 +01:00
Jehan	c59465adfc	BuildLangModel: save lang model directly in the right directory.	2015-11-29 13:26:10 +01:00
Jehan	72fbd33dec	Add a .gitignore.	2015-11-29 02:27:42 +01:00
Jehan	290fbd2e2e	BuildLangModel: add the licensing header to generated files.	2015-11-29 02:26:33 +01:00
Jehan	7f290975ba	BuildLangModel: map different cases of the same character together. With the new case_mapping lang property, we can consider upper and lower case versions of the same character as one character. This makes sense in some language, and would allow to enter some rarer characters (but still in the main alphabet) inside the frequent character list. For instance 'œ' and 'Œ' in French.	2015-11-29 02:14:48 +01:00
Jehan	00a78faa1d	BuildLangModel: the max_depth should be a script option... ... rather than a language property.	2015-11-29 01:59:28 +01:00
Jehan	274386f424	BuildLangModel: add a --max-page option to limit data size. This is mostly useful for debugging while we don't want to wait forever to test the script.	2015-11-29 01:42:36 +01:00
Jehan	0314f98ece	BuildLangModel.py: some in-progress script to build language models.	2015-11-29 01:30:04 +01:00
Jehan	a8e9de307b	Add UTF-16 test files without BOM... ... and disable the tests for now for these since uchardet is not able to detect UTF-16 without a BOM as for now.	2015-11-28 19:50:18 +01:00
Jehan	92efc0b0b0	Update README: Unicode is "International".	2015-11-28 19:44:13 +01:00

1 2 3

127 Commits