uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-24 04:34:47 +08:00

Author	SHA1	Message	Date
Jehan	26e1cebad1	LangModels: add support for Czech. Encodings: Windows-1250, ISO-8859-2, IBM852 and Mac-CentralEurope. Other encodings are known to have been used for Czech: Kamenicky, KOI-8 CS2 and Cork. But these are uncommon enough that I decided not to support them (especially since I can't find them supported in iconv either, or at least not under an alias which I could recognize). This web page, which contents was made under the Public Domain, is a good reference for encodings which were used historically for Czech and Slovak: http://luki.sdf-eu.org/txt/cs-encodings-faq.html	2016-09-21 03:33:50 +02:00
Jehan	183092d048	src: fix non-guarded 'if' warning. Not sure if this is useful to have the 'if (mDetectedCharset)' outside the if block, but it won't hurt for sure in this specific case, so I leave the current code logics as is. The exact warning was: nsUniversalDetector.cpp: In member function ‘virtual nsresult nsUniversalDetector::HandleData(const char*, PRUint32)’: nsUniversalDetector.cpp:115:5: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation] if (aLen > 2) ^~ nsUniversalDetector.cpp:157:7: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘if’ if (mDetectedCharset) ^~	2016-09-21 02:37:31 +02:00
Jehan	2700cf3a83	LangModels: support for Maltese / ISO-8859-3. Test text from https://mt.wikipedia.org/wiki/Franza.	2016-09-21 02:11:31 +02:00
Jehan	b7aebfdfda	LangModels: add support for Latvian \| Lithuanian / ISO-8859-4 \| ISO-8859-10. Just realizing that these 2 language can also be encoded with these charsets (even though ISO-8859-13 would appear to be more common… maybe?). Anyway now the models are updated and can recognize texts using these encoding for these languages. Added some test files as well, which work great.	2016-09-21 00:27:16 +02:00
Jehan	e138839f07	LangModels: add support for Portuguese / ISO-8859-1. I actually added also couples with ISO-8859-9, ISO-8859-15 and Windows-1252. Nevertheless there are no differences on the main characters related to Portuguese so differences will hardly be made and detection will usually return ISO-8859-1 only.	2016-09-21 00:01:07 +02:00
Jehan	ea2f4dd40f	LangModels: new support for Latvian / ISO-8859-13. Test text extracted from: https://lv.wikipedia.org/wiki/Vinsents_van_Gogs	2016-09-20 23:29:53 +02:00
Jehan	7cb3dd9ddd	LangModels: add support for Lithuanian / ISO-8859-13. Test text extracted from https://lt.wikipedia.org/wiki/Vincent_van_Gogh.	2016-09-20 23:09:24 +02:00
Jehan	157de1dc65	src: the EUC-KR prober now returns "UHC" as encoding name. "UHC" is the "Unified Hangul Code" (aka Windows-949 or CP949). It is apparently "mostly" upward compatible with EUC-KR so returning UHC for a strict EUC-KR document is usually not to be considered wrong. Yet I can read that EUC-KR has its own way of representing hangul syllables not available in precomposed form, and this is not supported in UHC (since this latter has all possible precomposed syllables), hence the "mostly" upward-compatibility. My personal daily experience with Korean documents though is that I encounter a lot of UHC-encoded files, probably because of predominance of Microsoft operating systems, which spread this encoding. So until we get 2 separate detection machines, let's just return EUC-KR files as being "UHC".	2016-09-19 01:22:45 +02:00
Jehan	771d78b7df	Update the URL links: uchardet is now a freedesktop project.	2016-07-20 01:47:50 +02:00
Jehan	210e52d99a	LangModels: update the Greek language models. I did this to improve the model after a user reported a Greek sutitle badly detected (see commit e0eec3b). It didn't help, but well... since I updated it with much more data from Wikipedia. Let's just commit it!	2016-05-25 17:39:10 +02:00
Jehan	e0eec3bae8	src: give a little weight to "probable sequences". Up to now, we were only considering positive sequences, which are sequences of 2 characters which happen the most. Yet our data gather 4 categories of sequences (the last one being called "negative", since they never happened in our data). I will call the category below positive: probable sequences. They may happen, yet not often. The last category could be called "neutral". This seems to fix the detection of a user's subtitle example without breaking any of our current unit tests. Probably I should still review this whole logics more in details later.	2016-05-25 17:38:20 +02:00
Jehan	4287d3accc	src: trailing whitespace removed.	2016-05-25 16:07:17 +02:00
Ilya Tumaykin	2a3e41a6c3	cmake: drop useless PACKAGE_NAME redefinition	2016-03-22 01:23:06 +03:00
Ilya Tumaykin	6db8b6f8fe	cmake: minor comment cleanups	2016-03-22 01:23:06 +03:00
Ilya Tumaykin	d0e7ddd8ab	cmake: fix library filename and SONAME Make library filename respect the current uchardet version and make library SONAME respect the current major version.	2016-03-22 01:23:05 +03:00
Ilya Tumaykin	ad647d2e0a	cmake: keep compiler definitions in one place	2016-03-22 01:23:05 +03:00
Ilya Tumaykin	29f18210b1	cmake: hardcode less	2016-03-22 01:23:04 +03:00
Ilya Tumaykin	7201835c98	cmake: export UCHARDET_LIBRARY to the topmost scope	2016-03-22 01:23:04 +03:00
Ilya Tumaykin	e7feb35627	cmake: rename UCHARDET_STATIC_{TARGET -> LIBRARY} for clarity	2016-03-22 01:23:04 +03:00
Ilya Tumaykin	1a1f4bfbd8	cmake: rename UCHARDET_{TARGET -> LIBRARY} for clarity	2016-03-22 01:23:03 +03:00
Ilya Tumaykin	31a53570d6	cmake: use GNUInstallDirs cmake module Available in cmake >= 2.8.5.	2016-03-22 01:23:03 +03:00
Ilya Tumaykin	b44be77be6	cmake: uniform indent everywhere Indent with tabs, remove leading/trailing blank lines and spaces.	2016-03-21 01:07:41 +03:00
Ricardo Constantino (:RiCON)	78b55ec9fe	CMake: Fix regression in f53cb8c building in paths with spaces Tested with Ninja and Make in Windows and Archlinux with paths with and without spaces.	2016-03-18 03:37:12 +00:00
Jehan	fcc525a64f	Merge pull request #25 from Coacher/master cmake: purge remnants of opencc after b6d872bb	2016-03-17 19:10:39 +01:00
Jehan	d255184609	Merge pull request #24 from wiiaboo/ab-suite Improving build with more options. Building only static possible, uchardet command line tool build can be disabled, bindir can be customized…	2016-03-17 19:09:30 +01:00
Ricardo Constantino (:RiCON)	86755b1f57	CMake: Don't build static more than once	2016-03-16 19:31:00 +00:00
Ricardo Constantino (:RiCON)	b908b689a0	CMake: Add static lib destination to UCHARDET_TARGET	2016-03-16 19:30:54 +00:00
Ricardo Constantino (:RiCON)	81ed86a26b	CMake: Use only CMAKE_INSTALL_BINDIR instead of DIR_BIN This way it always shows up in ccmake, even if not defined. A string is used instead of path because I personally think it makes more sense in the following use-cases: STRING: -DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins installs everything to /home/user/{lib,etc,share,(...)} and executables to ${CMAKE_INSTALL_PREFIX}/bins -DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin everything to /home/user/{lib,etc,share,(...)} and executables to /opt/bin PATH: -DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins everything to /home/user/{lib,etc,share,(...)} and executables to $(pwd)/bins (!) -DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin same as STRING	2016-03-16 19:11:33 +00:00
Ilya Tumaykin	aa4c2aeada	cmake: purge remnants of opencc after b6d872bb	2016-03-16 19:43:58 +03:00
Ricardo Constantino (:RiCON)	50b2e0802f	CMake: Allow not building executable	2016-03-16 14:34:03 +00:00
Ricardo Constantino (:RiCON)	6500f09931	CMake: Allow building static-only builds Add stdc++ to static libs in pkg-config	2016-03-16 14:30:15 +00:00
Ricardo Constantino (:RiCON)	f53cb8cddd	CMake: fix linking with Ninja	2016-03-16 14:17:47 +00:00
Jehan	923d264470	LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15). Test for ISO-8859-1 is disabled for now since the difference is not big enough, as for characters used in Danish, between ISO-8859-1 and ISO-8859-15. Therefore the first to be declared "wins". Let's see to improve this later. Test contents from: https://da.wikipedia.org/wiki/Eurosymbol https://da.wikipedia.org/wiki/Dansk_%28sprog%29	2016-02-19 19:10:41 +01:00
Jehan	98b5e52252	LangModels: add VISCII encoding support and retrain Vietnamese model.	2016-02-13 03:51:18 +01:00
Jehan	178c6119b8	LangModels: add Windows-1258 support for Vietnamese. I was planning on adding VISCII support as well, but Python encode() method does not have any support for it apparently, so I cannot generate the proper statistics data with the current version of the string.	2016-02-13 02:32:57 +01:00
Jehan	248d6dbd35	tools: exit with non-zero value on uchardet error.	2016-01-21 18:16:42 +01:00
Jehan	9c3c37517c	LangModels: add Arabic support. Models constructed for ISO-8859-6 and Windows-1256.	2015-12-13 18:42:16 +01:00
Jehan	ad2f7212e2	LangModels: retraining Greek models with my training script. This fixes our Greek/Windows-1253 test.	2015-12-13 18:02:11 +01:00
Jehan	ffabb65712	LangModels: adding Spanish support. With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.	2015-12-12 18:54:35 +01:00
Jehan	a251753db8	LangModels: updating Hungarian language models.	2015-12-12 18:06:17 +01:00
Jehan	4c8316f9cf	Nearly-ASCII text with NBSP is still not ASCII. There is no "exception" in encoding. The non-breaking space 0xA0 is not ASCII, and therefore returning "ASCII" will later create issues (for instance trying to re-encode with iconv produces an error). This was obviously an explicit decision in original code (according to code comments), probably tied to specifity of the original program from Mozilla. Now we want strict detection. I will return "ISO-8859-1" for "nearly-ASCII texts with NBSP as only exception" (note that I could have returned any ISO-8859 charsets since they all have this character in common).	2015-12-05 21:11:29 +01:00
Jehan	e5234d6b61	Stating endianness of UTF-16 and UTF-32 was an error when BOM present. According to RFC 2781, section 3.3: "Systems labelling UTF-16BE/LE text MUST NOT prepend a BOM to the text." Since uchardet cannot (and should not, obviously, it's not its role) modify input text, when a BOM is present, we should always label the encoding as "UTF-16" only. Also it broke unit tests in using programs since a conversion from UTF-8 to UTF-16LE/BE would create a text without BOM, and a conversion from UTF-16LE/BE to UTF-8 creates a UTF-8 text with a BOM, which changed existing behaviours. Same goes for UTF-32. See also Unicode 5.0.0 standard, section 3.10 (tables 3.8 and 3.9 in particular).	2015-12-04 19:19:39 +01:00
Jehan	5691dc59a1	LangModels: rename Cyrillic models to Russian models. Our language models are per-lang, not per script.	2015-12-04 03:27:29 +01:00
Jehan	fb3c47a073	LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models. ISO-8859-11 is basically exactly identical to TIS-620, with the added non-breaking space character. Basically our detection will always return TIS-620 except for exceptional cases when a text has a non-breaking space.	2015-12-04 03:14:52 +01:00
Jehan	5ee1c3ee39	LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9.	2015-12-04 02:35:09 +01:00
Jehan	f0e122b506	LangModels: add Esperanto ISO-8859-3 language model.	2015-12-04 01:35:56 +01:00
Jehan	55b4f23971	Single Byte charsets: high ctrl character ratio lowers confidence. Control characters are not an error per-se. Nevertheless they are clearly not frequent in single-byte charset texts. It is only normal for them to lower confidence in a charset. In particular a higher ctrl-per-letter ratio means a lower confidence. This fixes for instance our Windows-1252 German test (otherwise detected as ISO-8859-1).	2015-12-04 00:04:43 +01:00
Jehan	aa587a64bd	LangModels: adding German models for ISO-8859-1 and Windows-1252.	2015-12-03 23:58:41 +01:00
Jehan	0270b1e856	Adding French Windows-1252 support.	2015-12-03 21:22:30 +01:00
Jehan	ea34e8b1bd	Update doc comment. We do not return empty string on ASCII anymore. It means only detection failure, now. ASCII will get a proper "ASCII" return.	2015-12-03 20:36:09 +01:00
Jehan	ba56d91808	Update uchardet URL in various places.	2015-12-03 19:48:29 +01:00
Jehan	d1bc09e4d7	Update authors. I think I deserved being listed in the authors by now. ;-)	2015-12-03 19:44:13 +01:00
Jehan	c4fa728e7a	Merge branch 'master' of https://github.com/lovasoa/uchardet into lovasoa-master Let's shortcut Single Byte charset detection on invalid codepoints. Merging and fixing the contributor's commit conflicts after code redesign: in particular we added an illegal character concept (they were mixed with control characters in current charmaps. Yet ctrl characters are NOT to be considered invalid) and constants instead of hardcoded numbers ('ILL' rather than 255).	2015-12-03 19:26:19 +01:00
Jehan	d686fcc1cd	LangModels: add illegal codepoints information on single byte charmaps.	2015-12-03 19:04:07 +01:00
Jehan	683255278d	Re-enable Hungarian language models. Now that we have at least one model for ISO-8859-1, the risk of detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.	2015-12-02 22:24:36 +01:00
Jehan	4f1c3ff85e	nsSBCharSetProber: multiply confidence by ratio of positive seqs per chars. If all sequences in a text are positive sequences, the ratio of positive sequences cannot make the difference between 2 very close charsets. A ratio of positive sequences per letters on the other hand will change a tie between 2 encoding. If while adding a letter, the number of positive sequences does not increase, the confidence will decrease (corresponding to the fact it was likely not a letter). On the other hand, if the number of positive sequences increase, so will the confidence. For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15. When letters only available in ISO-8859-15 appear in a text, we expect confidence to tilt towards the close yet slightly different ISO-8859-15.	2015-11-30 19:52:07 +01:00
Jehan	9cb5764b73	LangModels: update the French language models. Fully built with the script.	2015-11-30 19:20:55 +01:00
Jehan	dbb4c1d2ff	nsSBCharSetProber: replace the fixed 64 SAMPLE_SIZE... ... with per-language model "frequent character" count.	2015-11-29 23:51:55 +01:00
Jehan	0289c2a232	Differentiate ASCII and detection failure. The lib used to return "" for both properly detected ASCII and detection failure. And the tool would return "ascii/unknown". Make a proper distinction between the 2 cases.	2015-11-28 17:04:52 +01:00
Jehan	005fd98086	Add initial support for French with ISO-8859-1 and ISO-8859-15. Mostly generated with a script from Wikipedia data (only the typical positive ratio is slightly modified). This is a first test before adding my generating script to the main tree.	2015-11-28 02:14:39 +01:00
Jehan	2106173546	Move all Single-Byte language models to a subdirectory.	2015-11-27 23:11:23 +01:00
Jehan	984d8f7b09	Add language information in model names when they were missing. Models are language specific (there could be several models for the same charset but different languages). Let's have a clear naming scheme.	2015-11-27 18:21:13 +01:00
Jehan	42b91898da	Create 3-letter constants for special charmap characters. Control characters, carriage, symbols and numbers. Also add a constant for illegal characters (not used for now). This will allow easier processing and charmap reading.	2015-11-27 17:41:54 +01:00
Ophir LOJKINE	5ef60164fc	Stop detection early on control characters	2015-11-24 22:07:41 +03:00
Jehan	e8dd55995a	Add "LE/BE" suffix to "UTF-16" result for Little/Big Endian info... ... and add UTF-32 BOM detection.	2015-11-24 18:50:23 +01:00
Jehan	9a74d08b3c	Fix minor space issues.	2015-11-24 00:15:44 +01:00
Jehan	35153b1e50	Fixes boolean operation precedence warnings... ... and some minor space issues. Some explicit parentheses were needed to make precedence obvious. Warning was: "warning: suggest parentheses around ‘&&’ within ‘\|\|’ [-Wparentheses]"	2015-11-18 19:38:12 +01:00
Jehan	9d9257072a	s/windows-1255/WINDOWS-1255/ to follow iconv uppercase naming.	2015-11-18 03:21:34 +01:00
Jehan	41f3b757f1	Some more encoding names changed to be iconv-compatible. I forgot to fix some names. In particular "x-mac-cyrillic" is not valid in iconv, and has been changed to "MAC-CYRILLIC".	2015-11-17 18:51:45 +01:00
Jehan	ad4dfc4be4	Add a BUILD_STATIC CMake option to optionally build a static library. It is still ON by default, which means both shared and static libs will be built and installed (current behavior), but it makes it possible to disable the build of a static lib. Closes https://github.com/BYVoid/uchardet/issues/1.	2015-11-17 18:14:51 +01:00
Jehan	dc371f3ba9	uchardet_get_charset() must return iconv-compatible names. It was not clear if our naming followed any kind of rules. In particular, iconv is a widely used encoding conversion API. We will follow its naming. At least 1 returned name was found invalid: x-euc-tw instead of EUC-TW. Other names have been uppercased to follow naming from `iconv --list` though iconv is mostly case-insensitive so it should not have been a problem. "Just in case". Prober names can still have free naming (only used for output display apparently). Finally HZ-GB-2312 is absent from my iconv list, but I can still see this encoding in libiconv master code with this name. So I will consider it valid.	2015-11-17 16:15:21 +01:00
Jehan	256d1957b2	uchardet_get_charset() should never return NULL... ... to stay backward compatible with previous behavior. About detection failure, our in-code documentation says: "@return name of charset on success and "" on failure or pure ascii." This behavior had been broken by commit 3a518c0, which returned NULL instead. Our command-line tool was the first victim, segfaulting on ASCII files.	2015-11-16 17:33:16 +01:00
Carbo Kuo	016eb18437	Merge pull request #15 from wang-bin/c++abi do not use std::string which breaks c++ abi	2015-11-09 20:04:21 +01:00
wang-bin	3a518c0536	do not use std::string which breaks c++ abi Some stl types can break abi. If the program is built with g++ 5, and libstdc++ on the target platform is g++ 4.x, then it can not run	2015-11-04 18:16:24 +08:00
Jehan	ba97505efc	(void) and () empty arguments are different in C. This fixes the following warning when including uchardet.h in C source, built with -Wstrict-prototypes: `uchardet.h:52:1: warning: function declaration isn't a prototype`	2015-09-05 15:58:56 +02:00
wm4	d59294a00e	Header conformance fixes Identifiers starting with __ are reserved for the system - user code (including non-system libraries) must not define them. A function which takes no parameters is declared with "(void)". In C, an empty parameter list means that any number of parameters with unspecified types is allowed, which is not what we want in this case. Another reason to fix this is that compilers often warn if this legacy feature is used, which is bothersome for API users. Additionally, use an opaque struct as underlying type for uchardet_t. This facilitates type-checking, as it's harder to confuse with other types, especially in C. This is not strictly a conformance issue, but still a nice change. Note that this is neither an API or an ABI change.	2015-08-05 22:24:49 +02:00
Loic Le Loarer	07af96b3a7	Use perror for error report	2015-07-16 01:20:03 +02:00
Loic Le Loarer	1c89a2f8ff	Use stdin by default as before	2015-07-16 01:15:08 +02:00
Loic Le Loarer	972d061e90	Allow multiple filename in the command line	2015-07-16 00:59:58 +02:00
nu774	f5637b23b8	fix for MinGW build	2015-06-20 12:28:01 +09:00
nu774	ba6679f2b3	fix: export symbols were not passed to the linker as intended	2015-06-20 12:28:01 +09:00
BYVoid	06e65096f1	Add comments on uchardet.h	2011-07-11 15:25:31 +08:00
BYVoid	84284eccf4	Update code from upstream.	2011-07-11 14:42:50 +08:00
BYVoid	331af64156	Add command line interface.	2011-07-10 16:42:38 +08:00
BYVoid	1b05009d4d	Update contributors information.	2011-07-10 15:43:28 +08:00
BYVoid	e948063c0e	Refine ucharder.h	2011-07-10 15:41:24 +08:00
BYVoid	1094508286	Dos2unix.	2011-07-10 15:20:41 +08:00
BYVoid	9be8afdfb9	Compelete comments on intercaface.	2011-07-10 15:20:05 +08:00
BYVoid	3601900164	Initial release.	2011-07-10 15:04:42 +08:00

1 2 3 4

189 Commits