uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-23 20:24:45 +08:00

Author	SHA1	Message	Date
Jehan	15fc8f0a0f	src: now reporting encoding+confidence and keeping a list. Preparing for an updated API which will also allow to loop at the confidence value, as well as get the list of possible candidate (i.e. all detected encoding which had a confidence value high enough so that we would even consider them). It is still only internal logics though.	2022-12-14 00:23:13 +01:00
Jehan	388777be51	script, src, test: add IBM865 support for Danish. Newly added IBM865 charset (for Norwegian) can also be used for Danish By the way, I fixed `script/charsets/ibm865.py` as Danish uses the 'da' ISO 639-1 code by the way, not 'dk' (which is sometimes used for other codes for Denmark, such as ISO 3166 country code and internet TLD) but not for the language itself. For the test, adding some text from the top article of the day on the Danish Wikipedia, which was about Jimi Hendrix. And that's cool! 🎸 ;-)	2022-11-30 19:57:52 +01:00
Martin T. H. Sandsmark	099a9a4fd6	Add norwegian support	2022-11-30 19:09:09 +01:00
Lucinda May Phipps	45bd32d102	src/tools/uchardet.cpp: make stuff static	2022-11-29 13:57:31 +00:00
Lucinda May Phipps	383bf118c9	don't use feof	2022-11-29 13:57:31 +00:00
Pedro López-Cabanillas	d7dad549bd	cmake exported targets The minimum required cmake version is raised to 3.1, because the exported targets started at that version. The build system creates the exported targets: - The executable uchardet::uchardet - The library uchardet::libuchardet - The static library uchardet::libuchardet_static A downstream project using CMake can find and link the library target directly with cmake (without needing pkg-config) this way: ~~~ project(sample LANGUAGES C) find_package ( uchardet ) if (uchardet_FOUND) add_executable( sample sample.c ) target_link_libraries ( sample PRIVATE uchardet::libuchardet ) endif () ~~~ After installing uchardet in a prefix like "$HOME/uchardet/": cmake -DCMAKE_PREFIX_PATH="$HOME/uchardet/;..." Instead installing, the build directory can be used directly, for instance: cmake -Duchardet_DIR="$HOME/uchardet-0.1.0/build/" ...	2021-11-09 09:52:15 +00:00
myd7349	8681fc060e	build: Add uchardet CLI tool building support for MSVC	2020-04-26 08:16:14 +00:00
myd7349	5bcbd23acf	build: Fix build errors on Windows - Fix string no output variables on UWP On UWP, CMAKE_SYSTEM_PROCESSOR may be empty. As a result: string(TOLOWER ${CMAKE_SYSTEM_PROCESSOR} TARGET_ARCHITECTURE) will be treated as: string(TOLOWER TARGET_ARCHITECTURE) which, as a result, will cause a CMake error: CMake Error at CMakeLists.txt:42 (string): string no output variable specified - Remove unnecessary header inclusions in uchardet.cpp These extra inclusions cause build errors on Windows.	2020-04-26 10:08:45 +08:00
Jehan	44a50c30ee	Issue #8 : no newline at end of file. Not sure if it is in the C++ standard, or was, but apparently some compilers may complain when files don't end with a newline (though neither GCC nor Clang as our CI and my local builds are fine). So here are all our generated source which didn't have such ending newline (hopefully I forgot none). I just loaded them in my vim editor, and resaved them. This was enough to add an ending newline.	2020-04-22 22:53:25 +02:00
Jehan	6c7f32a751	Issue #10 : Crashing sequence with nsSJISProber. uchardet_handle_data() should not try to process data of nul length. Still this is not technically an error to feed empty data to the engine, and I could imagine it could happen especially when done in some automatic process with random input files (which looks like what was happening in the reporter case). So feeding empty data just returns a success without actually doing any processing, allowing to continue the data feed.	2020-04-22 22:11:51 +02:00
Jehan	ef0313046b	Also allow uchardet tool to detect encoding of a file named "--". My previous commit was good except for the very special case of wanting to analyze a file named "--". This file would be ignored. With this change, only the first "--" option will be ignored as meaning "end of option arguments", but any remaining value (another "--" included) will be considered as a file path.	2020-04-22 21:11:23 +02:00
Jehan	4a37dfdf1c	Issue #15 : support "--" end-of-option.	2020-04-22 21:05:44 +02:00
wangqr	ae7acbd0f2	Add dllexport to interface functions This allows building the DLL on Windows with other compilers than GNU ones. See MR !4.	2020-04-22 18:54:07 +00:00
Artem Klevtsov	2694ba6363	Fix global-buffer-overflow due EUCTW_TABLE_SIZE	2020-04-22 17:06:40 +00:00
Jehan	e0b9269849	Fix various other occurrences of bug tracker URL in code/build.	2020-04-22 12:29:41 +02:00
Jehan	1898847eb6	src: cast value to its proper type. Thanks to Marino Faggiana for reporting it. See: https://github.com/BYVoid/uchardet/issues/37	2017-08-27 13:01:30 +02:00
Jehan	170ef349cf	src: fix some doc comments. s/a instance/an instance/. Unless mistaken, we should use "an" with next word starting with vowel.	2017-08-19 10:46:25 +02:00
Jehan	c049332c41	src: s/detctor/detector/.	2017-08-18 12:03:54 +02:00
Jehan	53f7ad0e0b	Bug 101032 - assignments to nsSMState in nsCodingStateMachine result... ... in unspecified behavior. When compiling with UBSan (-fsanitize=undefined), execution complains: > runtime error: load of value 5, which is not a valid value for type 'nsSMState' Since the machine states depend on every different charset's state machine, it is not possible to simply extend the enum with more generic values. Instead let's just make the state as an unsigned int value and define the 3 generic states as constants.	2017-05-28 20:01:06 +02:00
Jehan	50bc02c0ff	Request C++11 standard project-wise and make it a strong requirement. It is unneeded to do it by target, using the globale property CMAKE_CXX_STANDARD instead. Also with CMAKE_CXX_STANDARD_REQUIRED, I make this a strong requirement. The documentation indeed states that the CXX_STANDARD "is treated as optional and may “decay” to a previous standard if the requested is not available". This means that uchardet will likely not be buildable with a compiler with no C++11 support. But I assume this is not a common situation, and probably we should not care about outdated compilers. I remain open to suggestions and disagreement on the topic obviously.	2017-05-28 15:43:44 +02:00
Jehan	1bf198cb0f	Make C++11 the standard used for uchardet. As discussed in bug 101032, it seems like the most common usage nowadays. Let's make a specific choice to avoid different behavior on different builds later on.	2017-05-28 15:32:06 +02:00
Jehan	98bf4d73fd	Bug 101204 - different results with different chunk sizes. ASCII and ISO-8859-1 should not be detected in nsUniversalDetector::HandleData() but in nsUniversalDetector::DataEnd() instead. Otherwise it creates an unwanted shortcut from the first call to uchardet_handle_data() if the input is broken into several pieces and if the first chunk happens to be ASCII (or ASCII + NBSP).	2017-05-28 14:14:48 +02:00
Jehan	50743e16f8	src: minor indentation fix.	2017-05-14 21:35:11 +02:00
Jehan	94b10b9b29	Bug 101030 - Buffer overflow related to ISO2022JP detection in... ... en:ascii and ja:iso-2022-jp tests. I don't know much about this part of the code at this point. Yet I can clearly deduct that the length of the charLenTable is supposed to be the classFactor of the SMModel. Therefore 2 classes were missing in ISO2022JPCharLenTable, hence a buffer overflow happens when trying to reach these. I am not sure of the values I should add there. For now, let's set 0 to both, but adding also a comment so that I can review this code later on, when I will get to read and understand this piece of code in more depth.	2017-05-14 19:49:01 +02:00
Jehan	64efb1b24c	Bug 101031 - Memory leak of nsSBCSGroupProber. This manual incrementation code is just horrible and so error-prone. Some day, we should make a cleaner loop to register all these single-byte charset probers.	2017-05-14 18:24:11 +02:00
Jehan	119fed7e8d	LangModels: add Swedish support. Encodings: ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Test text from https://sv.wikipedia.org/wiki/Mölle	2016-09-28 22:42:13 +02:00
Jehan	d62154bd6e	LangModels: add Slovene support. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and MAC-CENTRALEUROPE. Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet	2016-09-28 22:13:17 +02:00
Jehan	fbd2efdbe9	LangModels: Romanian support added. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852. Test texts from https://ro.wikipedia.org/wiki/Danemarca	2016-09-28 19:57:50 +02:00
Jehan	a7525b404d	LangModels: added support for Irish Gaelic. Encodings: ISO-8859-1, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Test text from: https://ga.wikipedia.org/wiki/Gluais_théarmaí_seoltóireachta	2016-09-27 00:49:05 +02:00
Jehan	a3a271dfd5	LangModels: Estonian models created. Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and Windows-1257. Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov Windows-1257 and ISO-8859-13 are very close so I added quotation marks (Jutumärgid) which are on codepoints only present in ISO-8859-13, making both encoding apart.	2016-09-27 00:14:29 +02:00
Jehan	3c6d31f5c2	LangModels: new Croatian models. Supports: ISO-8859-2, ISO-8859-13, ISO-8859-16, IBM852, Windows-1250 and MAC-CENTRALEUROPE. Test text from https://hr.wikipedia.org/wiki/Brekinja	2016-09-26 01:32:49 +02:00
Jehan	05ba8555cd	src: fix number of Single-Byte charset probers.	2016-09-25 14:02:39 +02:00
Jehan	f262b1d65b	LangModels: add Italian support. Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15 and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added since other encoding end up similar as ISO-8859-1 for most common texts (i.e. glyphs used in Italian are on the same codepoints on these other encodings). Test text from https://it.wikipedia.org/wiki/Architettura_longobarda	2016-09-21 18:52:09 +02:00
Jehan	6bbe7da1ac	LangModels: add Finnish support. I built models for ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-13, ISO-8859-15 and WINDOWS-1252, which all contain Finnish letters. Nevertheless most texts in these encoding end up the same (same codepoints for the Finnish glyphs) so I keep only tests for ISO-8859-1 and UTF-8. Models for other encoding may still be useful when processing texts with some symbols, etc.	2016-09-21 18:27:39 +02:00
Jehan	a59b1c9571	src: update documentation comments on the public API.	2016-09-21 17:36:17 +02:00
Jehan	3401ac70d0	LangModels: add Polish support. With the following encodings: ISO-8859-2, ISO-8859-13, ISO-8859-16, Windows-1250, IBM852, MAC-CENTRALEUROPE. Test text from https://pl.wikipedia.org/wiki/Zofia_Holszańska	2016-09-21 17:30:15 +02:00
Jehan	5f9ec3aef0	LangModels: add support for Slovak. Encodings are the same as Czech (Windows-1250, ISO-8859-2 and Mac-CentralEurope) since the resource I found indicate they used the same encodings historically. Also it is to be noted that the test examples' encoding were already properly detected through Czech's models so the languages are definitely very close, even statistically. Nevertheless adding the right models will work better and these get better scores. This will take all its meaning when uchardet will also be used as a language detector (in some not-too-far future, hopefully!). Test text taken from: https://sk.wikipedia.org/wiki/Jupiter	2016-09-21 13:42:20 +02:00
Jehan	26e1cebad1	LangModels: add support for Czech. Encodings: Windows-1250, ISO-8859-2, IBM852 and Mac-CentralEurope. Other encodings are known to have been used for Czech: Kamenicky, KOI-8 CS2 and Cork. But these are uncommon enough that I decided not to support them (especially since I can't find them supported in iconv either, or at least not under an alias which I could recognize). This web page, which contents was made under the Public Domain, is a good reference for encodings which were used historically for Czech and Slovak: http://luki.sdf-eu.org/txt/cs-encodings-faq.html	2016-09-21 03:33:50 +02:00
Jehan	183092d048	src: fix non-guarded 'if' warning. Not sure if this is useful to have the 'if (mDetectedCharset)' outside the if block, but it won't hurt for sure in this specific case, so I leave the current code logics as is. The exact warning was: nsUniversalDetector.cpp: In member function ‘virtual nsresult nsUniversalDetector::HandleData(const char*, PRUint32)’: nsUniversalDetector.cpp:115:5: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation] if (aLen > 2) ^~ nsUniversalDetector.cpp:157:7: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘if’ if (mDetectedCharset) ^~	2016-09-21 02:37:31 +02:00
Jehan	2700cf3a83	LangModels: support for Maltese / ISO-8859-3. Test text from https://mt.wikipedia.org/wiki/Franza.	2016-09-21 02:11:31 +02:00
Jehan	b7aebfdfda	LangModels: add support for Latvian \| Lithuanian / ISO-8859-4 \| ISO-8859-10. Just realizing that these 2 language can also be encoded with these charsets (even though ISO-8859-13 would appear to be more common… maybe?). Anyway now the models are updated and can recognize texts using these encoding for these languages. Added some test files as well, which work great.	2016-09-21 00:27:16 +02:00
Jehan	e138839f07	LangModels: add support for Portuguese / ISO-8859-1. I actually added also couples with ISO-8859-9, ISO-8859-15 and Windows-1252. Nevertheless there are no differences on the main characters related to Portuguese so differences will hardly be made and detection will usually return ISO-8859-1 only.	2016-09-21 00:01:07 +02:00
Jehan	ea2f4dd40f	LangModels: new support for Latvian / ISO-8859-13. Test text extracted from: https://lv.wikipedia.org/wiki/Vinsents_van_Gogs	2016-09-20 23:29:53 +02:00
Jehan	7cb3dd9ddd	LangModels: add support for Lithuanian / ISO-8859-13. Test text extracted from https://lt.wikipedia.org/wiki/Vincent_van_Gogh.	2016-09-20 23:09:24 +02:00
Jehan	157de1dc65	src: the EUC-KR prober now returns "UHC" as encoding name. "UHC" is the "Unified Hangul Code" (aka Windows-949 or CP949). It is apparently "mostly" upward compatible with EUC-KR so returning UHC for a strict EUC-KR document is usually not to be considered wrong. Yet I can read that EUC-KR has its own way of representing hangul syllables not available in precomposed form, and this is not supported in UHC (since this latter has all possible precomposed syllables), hence the "mostly" upward-compatibility. My personal daily experience with Korean documents though is that I encounter a lot of UHC-encoded files, probably because of predominance of Microsoft operating systems, which spread this encoding. So until we get 2 separate detection machines, let's just return EUC-KR files as being "UHC".	2016-09-19 01:22:45 +02:00
Jehan	771d78b7df	Update the URL links: uchardet is now a freedesktop project.	2016-07-20 01:47:50 +02:00
Jehan	210e52d99a	LangModels: update the Greek language models. I did this to improve the model after a user reported a Greek sutitle badly detected (see commit e0eec3b). It didn't help, but well... since I updated it with much more data from Wikipedia. Let's just commit it!	2016-05-25 17:39:10 +02:00
Jehan	e0eec3bae8	src: give a little weight to "probable sequences". Up to now, we were only considering positive sequences, which are sequences of 2 characters which happen the most. Yet our data gather 4 categories of sequences (the last one being called "negative", since they never happened in our data). I will call the category below positive: probable sequences. They may happen, yet not often. The last category could be called "neutral". This seems to fix the detection of a user's subtitle example without breaking any of our current unit tests. Probably I should still review this whole logics more in details later.	2016-05-25 17:38:20 +02:00
Jehan	4287d3accc	src: trailing whitespace removed.	2016-05-25 16:07:17 +02:00
Ilya Tumaykin	2a3e41a6c3	cmake: drop useless PACKAGE_NAME redefinition	2016-03-22 01:23:06 +03:00
Ilya Tumaykin	6db8b6f8fe	cmake: minor comment cleanups	2016-03-22 01:23:06 +03:00
Ilya Tumaykin	d0e7ddd8ab	cmake: fix library filename and SONAME Make library filename respect the current uchardet version and make library SONAME respect the current major version.	2016-03-22 01:23:05 +03:00
Ilya Tumaykin	ad647d2e0a	cmake: keep compiler definitions in one place	2016-03-22 01:23:05 +03:00
Ilya Tumaykin	29f18210b1	cmake: hardcode less	2016-03-22 01:23:04 +03:00
Ilya Tumaykin	7201835c98	cmake: export UCHARDET_LIBRARY to the topmost scope	2016-03-22 01:23:04 +03:00
Ilya Tumaykin	e7feb35627	cmake: rename UCHARDET_STATIC_{TARGET -> LIBRARY} for clarity	2016-03-22 01:23:04 +03:00
Ilya Tumaykin	1a1f4bfbd8	cmake: rename UCHARDET_{TARGET -> LIBRARY} for clarity	2016-03-22 01:23:03 +03:00
Ilya Tumaykin	31a53570d6	cmake: use GNUInstallDirs cmake module Available in cmake >= 2.8.5.	2016-03-22 01:23:03 +03:00
Ilya Tumaykin	b44be77be6	cmake: uniform indent everywhere Indent with tabs, remove leading/trailing blank lines and spaces.	2016-03-21 01:07:41 +03:00
Ricardo Constantino (:RiCON)	78b55ec9fe	CMake: Fix regression in f53cb8c building in paths with spaces Tested with Ninja and Make in Windows and Archlinux with paths with and without spaces.	2016-03-18 03:37:12 +00:00
Jehan	fcc525a64f	Merge pull request #25 from Coacher/master cmake: purge remnants of opencc after b6d872bb	2016-03-17 19:10:39 +01:00
Jehan	d255184609	Merge pull request #24 from wiiaboo/ab-suite Improving build with more options. Building only static possible, uchardet command line tool build can be disabled, bindir can be customized…	2016-03-17 19:09:30 +01:00
Ricardo Constantino (:RiCON)	86755b1f57	CMake: Don't build static more than once	2016-03-16 19:31:00 +00:00
Ricardo Constantino (:RiCON)	b908b689a0	CMake: Add static lib destination to UCHARDET_TARGET	2016-03-16 19:30:54 +00:00
Ricardo Constantino (:RiCON)	81ed86a26b	CMake: Use only CMAKE_INSTALL_BINDIR instead of DIR_BIN This way it always shows up in ccmake, even if not defined. A string is used instead of path because I personally think it makes more sense in the following use-cases: STRING: -DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins installs everything to /home/user/{lib,etc,share,(...)} and executables to ${CMAKE_INSTALL_PREFIX}/bins -DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin everything to /home/user/{lib,etc,share,(...)} and executables to /opt/bin PATH: -DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins everything to /home/user/{lib,etc,share,(...)} and executables to $(pwd)/bins (!) -DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin same as STRING	2016-03-16 19:11:33 +00:00
Ilya Tumaykin	aa4c2aeada	cmake: purge remnants of opencc after b6d872bb	2016-03-16 19:43:58 +03:00
Ricardo Constantino (:RiCON)	50b2e0802f	CMake: Allow not building executable	2016-03-16 14:34:03 +00:00
Ricardo Constantino (:RiCON)	6500f09931	CMake: Allow building static-only builds Add stdc++ to static libs in pkg-config	2016-03-16 14:30:15 +00:00
Ricardo Constantino (:RiCON)	f53cb8cddd	CMake: fix linking with Ninja	2016-03-16 14:17:47 +00:00
Jehan	923d264470	LangModels: add Danish support (Windows-1252, ISO-8859-1 and ISO-8859-15). Test for ISO-8859-1 is disabled for now since the difference is not big enough, as for characters used in Danish, between ISO-8859-1 and ISO-8859-15. Therefore the first to be declared "wins". Let's see to improve this later. Test contents from: https://da.wikipedia.org/wiki/Eurosymbol https://da.wikipedia.org/wiki/Dansk_%28sprog%29	2016-02-19 19:10:41 +01:00
Jehan	98b5e52252	LangModels: add VISCII encoding support and retrain Vietnamese model.	2016-02-13 03:51:18 +01:00
Jehan	178c6119b8	LangModels: add Windows-1258 support for Vietnamese. I was planning on adding VISCII support as well, but Python encode() method does not have any support for it apparently, so I cannot generate the proper statistics data with the current version of the string.	2016-02-13 02:32:57 +01:00
Jehan	248d6dbd35	tools: exit with non-zero value on uchardet error.	2016-01-21 18:16:42 +01:00
Jehan	9c3c37517c	LangModels: add Arabic support. Models constructed for ISO-8859-6 and Windows-1256.	2015-12-13 18:42:16 +01:00
Jehan	ad2f7212e2	LangModels: retraining Greek models with my training script. This fixes our Greek/Windows-1253 test.	2015-12-13 18:02:11 +01:00
Jehan	ffabb65712	LangModels: adding Spanish support. With 3 charsets: ISO-8859-1, ISO-8859-15 and Windows-1252.	2015-12-12 18:54:35 +01:00
Jehan	a251753db8	LangModels: updating Hungarian language models.	2015-12-12 18:06:17 +01:00
Jehan	4c8316f9cf	Nearly-ASCII text with NBSP is still not ASCII. There is no "exception" in encoding. The non-breaking space 0xA0 is not ASCII, and therefore returning "ASCII" will later create issues (for instance trying to re-encode with iconv produces an error). This was obviously an explicit decision in original code (according to code comments), probably tied to specifity of the original program from Mozilla. Now we want strict detection. I will return "ISO-8859-1" for "nearly-ASCII texts with NBSP as only exception" (note that I could have returned any ISO-8859 charsets since they all have this character in common).	2015-12-05 21:11:29 +01:00
Jehan	e5234d6b61	Stating endianness of UTF-16 and UTF-32 was an error when BOM present. According to RFC 2781, section 3.3: "Systems labelling UTF-16BE/LE text MUST NOT prepend a BOM to the text." Since uchardet cannot (and should not, obviously, it's not its role) modify input text, when a BOM is present, we should always label the encoding as "UTF-16" only. Also it broke unit tests in using programs since a conversion from UTF-8 to UTF-16LE/BE would create a text without BOM, and a conversion from UTF-16LE/BE to UTF-8 creates a UTF-8 text with a BOM, which changed existing behaviours. Same goes for UTF-32. See also Unicode 5.0.0 standard, section 3.10 (tables 3.8 and 3.9 in particular).	2015-12-04 19:19:39 +01:00
Jehan	5691dc59a1	LangModels: rename Cyrillic models to Russian models. Our language models are per-lang, not per script.	2015-12-04 03:27:29 +01:00
Jehan	fb3c47a073	LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models. ISO-8859-11 is basically exactly identical to TIS-620, with the added non-breaking space character. Basically our detection will always return TIS-620 except for exceptional cases when a text has a non-breaking space.	2015-12-04 03:14:52 +01:00
Jehan	5ee1c3ee39	LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9.	2015-12-04 02:35:09 +01:00
Jehan	f0e122b506	LangModels: add Esperanto ISO-8859-3 language model.	2015-12-04 01:35:56 +01:00
Jehan	55b4f23971	Single Byte charsets: high ctrl character ratio lowers confidence. Control characters are not an error per-se. Nevertheless they are clearly not frequent in single-byte charset texts. It is only normal for them to lower confidence in a charset. In particular a higher ctrl-per-letter ratio means a lower confidence. This fixes for instance our Windows-1252 German test (otherwise detected as ISO-8859-1).	2015-12-04 00:04:43 +01:00
Jehan	aa587a64bd	LangModels: adding German models for ISO-8859-1 and Windows-1252.	2015-12-03 23:58:41 +01:00
Jehan	0270b1e856	Adding French Windows-1252 support.	2015-12-03 21:22:30 +01:00
Jehan	ea34e8b1bd	Update doc comment. We do not return empty string on ASCII anymore. It means only detection failure, now. ASCII will get a proper "ASCII" return.	2015-12-03 20:36:09 +01:00
Jehan	ba56d91808	Update uchardet URL in various places.	2015-12-03 19:48:29 +01:00
Jehan	d1bc09e4d7	Update authors. I think I deserved being listed in the authors by now. ;-)	2015-12-03 19:44:13 +01:00
Jehan	c4fa728e7a	Merge branch 'master' of https://github.com/lovasoa/uchardet into lovasoa-master Let's shortcut Single Byte charset detection on invalid codepoints. Merging and fixing the contributor's commit conflicts after code redesign: in particular we added an illegal character concept (they were mixed with control characters in current charmaps. Yet ctrl characters are NOT to be considered invalid) and constants instead of hardcoded numbers ('ILL' rather than 255).	2015-12-03 19:26:19 +01:00
Jehan	d686fcc1cd	LangModels: add illegal codepoints information on single byte charmaps.	2015-12-03 19:04:07 +01:00
Jehan	683255278d	Re-enable Hungarian language models. Now that we have at least one model for ISO-8859-1, the risk of detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.	2015-12-02 22:24:36 +01:00
Jehan	4f1c3ff85e	nsSBCharSetProber: multiply confidence by ratio of positive seqs per chars. If all sequences in a text are positive sequences, the ratio of positive sequences cannot make the difference between 2 very close charsets. A ratio of positive sequences per letters on the other hand will change a tie between 2 encoding. If while adding a letter, the number of positive sequences does not increase, the confidence will decrease (corresponding to the fact it was likely not a letter). On the other hand, if the number of positive sequences increase, so will the confidence. For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15. When letters only available in ISO-8859-15 appear in a text, we expect confidence to tilt towards the close yet slightly different ISO-8859-15.	2015-11-30 19:52:07 +01:00
Jehan	9cb5764b73	LangModels: update the French language models. Fully built with the script.	2015-11-30 19:20:55 +01:00
Jehan	dbb4c1d2ff	nsSBCharSetProber: replace the fixed 64 SAMPLE_SIZE... ... with per-language model "frequent character" count.	2015-11-29 23:51:55 +01:00
Jehan	0289c2a232	Differentiate ASCII and detection failure. The lib used to return "" for both properly detected ASCII and detection failure. And the tool would return "ascii/unknown". Make a proper distinction between the 2 cases.	2015-11-28 17:04:52 +01:00
Jehan	005fd98086	Add initial support for French with ISO-8859-1 and ISO-8859-15. Mostly generated with a script from Wikipedia data (only the typical positive ratio is slightly modified). This is a first test before adding my generating script to the main tree.	2015-11-28 02:14:39 +01:00
Jehan	2106173546	Move all Single-Byte language models to a subdirectory.	2015-11-27 23:11:23 +01:00
Jehan	984d8f7b09	Add language information in model names when they were missing. Models are language specific (there could be several models for the same charset but different languages). Let's have a clear naming scheme.	2015-11-27 18:21:13 +01:00
Jehan	42b91898da	Create 3-letter constants for special charmap characters. Control characters, carriage, symbols and numbers. Also add a constant for illegal characters (not used for now). This will allow easier processing and charmap reading.	2015-11-27 17:41:54 +01:00

1 2 3 4

176 Commits