uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-01-01 03:12:24 +08:00

Author	SHA1	Message	Date
Jehan	0314f98ece	BuildLangModel.py: some in-progress script to build language models.	2015-11-29 01:30:04 +01:00
Jehan	a8e9de307b	Add UTF-16 test files without BOM... ... and disable the tests for now for these since uchardet is not able to detect UTF-16 without a BOM as for now.	2015-11-28 19:50:18 +01:00
Jehan	92efc0b0b0	Update README: Unicode is "International".	2015-11-28 19:44:13 +01:00
Jehan	573b303fe3	Add an ASCII test file for English... ... with escape characters because even with ESC, a file is ASCII unless proven otherwise.	2015-11-28 17:49:13 +01:00
Jehan	0289c2a232	Differentiate ASCII and detection failure. The lib used to return "" for both properly detected ASCII and detection failure. And the tool would return "ascii/unknown". Make a proper distinction between the 2 cases.	2015-11-28 17:04:52 +01:00
Jehan	4dbc6e7ab3	Update README with French support.	2015-11-28 02:20:57 +01:00
Jehan	50588ba375	Add a ISO-8859-15 test file for French.	2015-11-28 02:18:57 +01:00
Jehan	005fd98086	Add initial support for French with ISO-8859-1 and ISO-8859-15. Mostly generated with a script from Wikipedia data (only the typical positive ratio is slightly modified). This is a first test before adding my generating script to the main tree.	2015-11-28 02:14:39 +01:00
Jehan	2106173546	Move all Single-Byte language models to a subdirectory.	2015-11-27 23:11:23 +01:00
Jehan	b67370230b	Update README and manual... ... to indicate several files can be specified on command line.	2015-11-27 18:27:11 +01:00
Jehan	984d8f7b09	Add language information in model names when they were missing. Models are language specific (there could be several models for the same charset but different languages). Let's have a clear naming scheme.	2015-11-27 18:21:13 +01:00
Jehan	c61e65aeb3	s/MACCYRILLIC/MAC-CYRILLIC/ Write encoding names in README same as what uchardet returns.	2015-11-27 18:19:02 +01:00
Jehan	942ac05ff5	Add some Russian test files. Texts from: IBM855: https://ru.wikipedia.org/wiki/CP855 IBM866: https://ru.wikipedia.org/wiki/Альтернативная_кодировка MAC-CYRILLIC: https://ru.wikipedia.org/wiki/MacCyrillic	2015-11-27 18:17:20 +01:00
Jehan	42b91898da	Create 3-letter constants for special charmap characters. Control characters, carriage, symbols and numbers. Also add a constant for illegal characters (not used for now). This will allow easier processing and charmap reading.	2015-11-27 17:41:54 +01:00
Jehan	7fa0fefef8	Add UTF-16 and UTF-32 test files in French, with BOM. Unfortunately uchardet currently seems unable to detect UTF-16/32 text without a BOM.	2015-11-26 02:45:00 +01:00
Jehan	e8dd55995a	Add "LE/BE" suffix to "UTF-16" result for Little/Big Endian info... ... and add UTF-32 BOM detection.	2015-11-24 18:50:23 +01:00
Jehan	9a74d08b3c	Fix minor space issues.	2015-11-24 00:15:44 +01:00
Jehan	d082704fec	Add Mageia command and specify Mint compatibility.	2015-11-23 17:46:01 +01:00
Jehan	ff5fd5eff9	Release: version 0.0.3. v0.0.3	2015-11-19 15:18:11 +01:00
Jehan	5dcff7b241	Hide away tests known to fail. Some charsets are simply not supported (ex: fr:iso-8859-1), some are temporarily deactivated (ex: hu:iso-8859-2) and some are wrongly detected as closely related charsets. These were broken (or not efficient) from the start, and there is no need to pollute the `make test` output with these, which may make us miss when actual regressions will occur. So let's hide these away for now until we can improve the situation.	2015-11-18 20:02:58 +01:00
Jehan	4b38e68aa2	CMake tests: separate the lang and charset with colon... ... rather than an hyphen. It makes it easier to read.	2015-11-18 19:42:35 +01:00
Jehan	35153b1e50	Fixes boolean operation precedence warnings... ... and some minor space issues. Some explicit parentheses were needed to make precedence obvious. Warning was: "warning: suggest parentheses around ‘&&’ within ‘\|\|’ [-Wparentheses]"	2015-11-18 19:38:12 +01:00
Jehan	0d70a36910	Adding some more test files for Russian and Chinese. Taken from: https://zh.wikipedia.org/wiki/EUC https://ru.wikipedia.org/wiki/КОИ-8 And rename a file s/utf8.txt/utf-8.txt/ to fix a build test.	2015-11-18 19:27:38 +01:00
Jehan	eb727d3aca	Add automatic testing against every test file.	2015-11-18 18:18:27 +01:00
Jehan	f303a41735	Add Thai test file for UTF-8. Text from Thai Wikipedia: https://th.wikipedia.org/wiki/ยูนิโคด	2015-11-18 03:26:34 +01:00
Jehan	9d9257072a	s/windows-1255/WINDOWS-1255/ to follow iconv uppercase naming.	2015-11-18 03:21:34 +01:00
Jehan	e7c8114233	Add Hebrew test files. Texts from Hebrew Wikipedia: https://he.wikipedia.org/wiki/עברית https://he.wikipedia.org/wiki/ISO_8859 https://he.wikipedia.org/wiki/UTF-8 uchardet fails to detect the ISO-8859-8 files and detects it as Windows-1255, which is probably acceptable since it is apparently an "almost compatible superset". It may be worth trying to make more complete test files in the future to demonstrate the differences.	2015-11-18 03:16:18 +01:00
Jehan	601e59bd83	Add Greek test files. Taken from Greek Wikipedia: https://el.wikipedia.org/wiki/UTF-8 https://el.wikipedia.org/wiki/ISO_8859-7 https://el.wikipedia.org/wiki/ISO_8859-7#Windows-1253 Windows-1253 test fails and returns "ISO-8859-7". They are actually fairly close for main letters, except for Ά, which make them difficult to differentiate.	2015-11-18 02:57:09 +01:00
Jehan	c8532f63a8	Adding UTF-8 file for Korean. Text taken from Korean Wikipedia: https://ko.wikipedia.org/wiki/UTF-8	2015-11-18 02:36:33 +01:00
Jehan	1a58fa6d99	Update AUTHORS.	2015-11-17 21:51:59 +01:00
Jehan	4db0d55692	URL of related project python-chardet has changed.	2015-11-17 21:40:44 +01:00
Jehan	a76c0786b3	Adding test files for main Japanese encoding... ... taken from the following Japanese Wikipedia pages: https://ja.wikipedia.org/wiki/Extended_Unix_Code https://ja.wikipedia.org/wiki/ISO/IEC_2022 https://ja.wikipedia.org/wiki/UTF-8	2015-11-17 21:24:47 +01:00
Jehan	0efcdfa546	Reorganize test files in language subdirectories. I realize that the language information a text has been written in is very important since it would completely change the character distribution. Our test files should take this into account, and we should create several test files in different languages for encoding used in various languages.	2015-11-17 21:12:39 +01:00
Jehan	192b0e7d51	Add test files for ISO-8859-[12]. Taken from French page about ISO-8859-1: https://fr.wikipedia.org/wiki/ISO_8859-1 ... and Hungarian Wikipedia page about ISO-8859-2: https://hu.wikipedia.org/wiki/ISO/IEC_8859-2 We don't have support for ISO-8859-1, and both these files are detected as "WINDOWS-1252" (which is acceptable for iso-8859-1.txt since Windows-1252 is a superset of ISO-8859-1). ISO-8859-2 support is disabled because the ISO-8859-1 file would be detected as ISO-8859-2, which would in turn be a clear error.	2015-11-17 19:39:58 +01:00
Jehan	3f3f4b8011	Add a ISO-8859-5 test file. Text taken from Russian Wikipedia page about ISO-8859-5: https://ru.wikipedia.org/wiki/ISO_8859-5	2015-11-17 19:11:59 +01:00
Jehan	bafccfcea8	Add a Windows-1251 test files. Texts taken from Bulgarian Wikipedia page about Windows-1251: https://bg.wikipedia.org/wiki/Windows-1251 ... and Russian Wikipedia page about Windows-1251: https://ru.wikipedia.org/wiki/Windows-1251 The Bulgarian file detection is right, but the Russian detection returns "MAC-CYRILLIC", which is an error and should be fixed.	2015-11-17 19:09:37 +01:00
Jehan	41f3b757f1	Some more encoding names changed to be iconv-compatible. I forgot to fix some names. In particular "x-mac-cyrillic" is not valid in iconv, and has been changed to "MAC-CYRILLIC".	2015-11-17 18:51:45 +01:00
Jehan	8216f7b395	Add an ISO-2022-KR test file. Text taken from Korean Wikipedia page about the ISO-2022-KR: https://ko.wikipedia.org/wiki/ISO/IEC_2022	2015-11-17 18:23:46 +01:00
Jehan	ad4dfc4be4	Add a BUILD_STATIC CMake option to optionally build a static library. It is still ON by default, which means both shared and static libs will be built and installed (current behavior), but it makes it possible to disable the build of a static lib. Closes https://github.com/BYVoid/uchardet/issues/1.	2015-11-17 18:14:51 +01:00
Jehan	9172b763d1	Add TIS-620 in README (Thai language) and a test file. Test text based on Thai Wikipedia page about the TIS-620 encoding: https://th.wikipedia.org/wiki/TIS-620	2015-11-17 17:39:45 +01:00
Jehan	399c4c4d9e	Add libchardet in related projects. See https://github.com/BYVoid/uchardet/issues/11 for review of differences with uchardet.	2015-11-17 17:12:44 +01:00
Jehan	362e36d1ed	Add EUC-KR test file. Contains text taken from Wikipedia on EUC-KR page in Korean. https://ko.wikipedia.org/wiki/EUC-KR I added it as a simili-subtitle file because as the original Mozilla paper says: "The input text may contain extraneous noises which have no relation to its encoding, e.g. HTML tags, non-native words". Therefore I feel it is important to have test files a little noisy if possible, in order to test our resistance to noise in our algorithm.	2015-11-17 16:36:17 +01:00
Jehan	dc371f3ba9	uchardet_get_charset() must return iconv-compatible names. It was not clear if our naming followed any kind of rules. In particular, iconv is a widely used encoding conversion API. We will follow its naming. At least 1 returned name was found invalid: x-euc-tw instead of EUC-TW. Other names have been uppercased to follow naming from `iconv --list` though iconv is mostly case-insensitive so it should not have been a problem. "Just in case". Prober names can still have free naming (only used for output display apparently). Finally HZ-GB-2312 is absent from my iconv list, but I can still see this encoding in libiconv master code with this name. So I will consider it valid.	2015-11-17 16:15:21 +01:00
Jehan	256d1957b2	uchardet_get_charset() should never return NULL... ... to stay backward compatible with previous behavior. About detection failure, our in-code documentation says: "@return name of charset on success and "" on failure or pure ascii." This behavior had been broken by commit 3a518c0, which returned NULL instead. Our command-line tool was the first victim, segfaulting on ASCII files.	2015-11-16 17:33:16 +01:00
Jehan	d0ccdd5db9	Release: version 0.0.2. v0.0.2	2015-11-16 15:56:45 +01:00
Carbo Kuo	016eb18437	Merge pull request #15 from wang-bin/c++abi do not use std::string which breaks c++ abi	2015-11-09 20:04:21 +01:00
Carbo Kuo	124d99bcd7	Merge pull request #9 from Jehan/master (void) and () empty arguments are different in C.	2015-11-09 20:01:22 +01:00
Carbo Kuo	6d562268c3	Merge pull request #13 from cicku/patch-1 Refine Description in pkgconfig file	2015-11-09 19:58:47 +01:00
wang-bin	3a518c0536	do not use std::string which breaks c++ abi Some stl types can break abi. If the program is built with g++ 5, and libstdc++ on the target platform is g++ 4.x, then it can not run	2015-11-04 18:16:24 +08:00
Christopher Meng	a55c6d26af	Refine Description in pkgconfig file	2015-09-21 09:37:36 +08:00

1 2

79 Commits