uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2026-02-07 02:09:59 +08:00

Author	SHA1	Message	Date
Jehan	2a4d8d890e	Using the generic language detector in UTF-8 detection. Now the UTF-8 prober would not only detect valid UTF-8, but would also detect the most probable language. Using the data generated 2 commits away, this works very well. This is still basic and will require even more improvements. In particular, now the nsUTF8Prober should return an array of ("UTF-8", language) couple candidate. And nsMBCSGroupProber should itself forward these candidates as well as other candidates from other multi-byte detectors. This way, the public-facing API would get more probable candidates, in case the algorithm is slightly wrong. Also the UTF-8 confidence is currently stupidly high as soon as we consider it to be right. We should likely weigh it with language detection (in particular, if no language is detected, this should severely weigh down UTF-8 detection; not to 0, but high enough to be a fallback in case no other encoding+lang is valid and low enough to give chances to other good candidate couples.	2021-03-16 18:37:09 +01:00
Jehan	911695f682	src: new API to get the detected language. This doesn't work for all probers yet, in particular not for the most generic probers (such as UTF-8) or WINDOWS-1252. These will return NULL. It's still a good first step. Right now, it returns the 2-character language code from ISO 639-1. A using project could easily get the English language name from the XML/json files provided by the iso-codes project. This project will also allow to easily localize the language name in other languages through gettext (this is what we do in GIMP for instance). I don't add any dependency though and leave it to downstream projects to implement this. I was also wondering if we want to support region information for cases when it would make sense. I especially wondered about it for Chinese encodings as some of them seem quite specific to a region (according to Wikipedia at least). For the time being though, these just return "zh". We'll see later if it makes sense to be more accurate (maybe depending on reports?).	2021-03-14 00:12:30 +01:00
Jehan	b43d938804	src: now reporting encoding+confidence and keeping a list. Preparing for an updated API which will also allow to loop at the confidence value, as well as get the list of possible candidate (i.e. all detected encoding which had a confidence value high enough so that we would even consider them). It is still only internal logics though.	2021-03-14 00:12:30 +01:00
Jehan	98bf4d73fd	Bug 101204 - different results with different chunk sizes. ASCII and ISO-8859-1 should not be detected in nsUniversalDetector::HandleData() but in nsUniversalDetector::DataEnd() instead. Otherwise it creates an unwanted shortcut from the first call to uchardet_handle_data() if the input is broken into several pieces and if the first chunk happens to be ASCII (or ASCII + NBSP).	2017-05-28 14:14:48 +02:00
Jehan	183092d048	src: fix non-guarded 'if' warning. Not sure if this is useful to have the 'if (mDetectedCharset)' outside the if block, but it won't hurt for sure in this specific case, so I leave the current code logics as is. The exact warning was: nsUniversalDetector.cpp: In member function ‘virtual nsresult nsUniversalDetector::HandleData(const char*, PRUint32)’: nsUniversalDetector.cpp:115:5: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation] if (aLen > 2) ^~ nsUniversalDetector.cpp:157:7: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘if’ if (mDetectedCharset) ^~	2016-09-21 02:37:31 +02:00
Jehan	4c8316f9cf	Nearly-ASCII text with NBSP is still not ASCII. There is no "exception" in encoding. The non-breaking space 0xA0 is not ASCII, and therefore returning "ASCII" will later create issues (for instance trying to re-encode with iconv produces an error). This was obviously an explicit decision in original code (according to code comments), probably tied to specifity of the original program from Mozilla. Now we want strict detection. I will return "ISO-8859-1" for "nearly-ASCII texts with NBSP as only exception" (note that I could have returned any ISO-8859 charsets since they all have this character in common).	2015-12-05 21:11:29 +01:00
Jehan	e5234d6b61	Stating endianness of UTF-16 and UTF-32 was an error when BOM present. According to RFC 2781, section 3.3: "Systems labelling UTF-16BE/LE text MUST NOT prepend a BOM to the text." Since uchardet cannot (and should not, obviously, it's not its role) modify input text, when a BOM is present, we should always label the encoding as "UTF-16" only. Also it broke unit tests in using programs since a conversion from UTF-8 to UTF-16LE/BE would create a text without BOM, and a conversion from UTF-16LE/BE to UTF-8 creates a UTF-8 text with a BOM, which changed existing behaviours. Same goes for UTF-32. See also Unicode 5.0.0 standard, section 3.10 (tables 3.8 and 3.9 in particular).	2015-12-04 19:19:39 +01:00
Jehan	0289c2a232	Differentiate ASCII and detection failure. The lib used to return "" for both properly detected ASCII and detection failure. And the tool would return "ascii/unknown". Make a proper distinction between the 2 cases.	2015-11-28 17:04:52 +01:00
Jehan	e8dd55995a	Add "LE/BE" suffix to "UTF-16" result for Little/Big Endian info... ... and add UTF-32 BOM detection.	2015-11-24 18:50:23 +01:00
Jehan	9a74d08b3c	Fix minor space issues.	2015-11-24 00:15:44 +01:00
BYVoid	84284eccf4	Update code from upstream.	2011-07-11 14:42:50 +08:00
BYVoid	e948063c0e	Refine ucharder.h	2011-07-10 15:41:24 +08:00
BYVoid	3601900164	Initial release.	2011-07-10 15:04:42 +08:00

13 Commits