uchardet

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-06 16:56:40 +08:00

Author	SHA1	Message	Date
Jehan	cec8817d79	src: new Big5 detection implementation. Rather than using a huge frequency table through some state machine code that I don't even understand, I noticed that the Big5 encoding is from the start organized in frequent and non-frequent characters tables (per Wikipedia page on Big5). This makes it very easy to count characters by just counting which class each character is in. Making a few tests with random Chinese text converted to Big5, it seems to work pretty well (and fix the test which got broken with previous commit), and it doesn't slow down detection in any significant way either. This may be the next step towards improving also the various multi-byte encoding detection, which are still using some coding state generated machines which mostly still elude me.	2022-12-19 00:01:12 +01:00
LSY	d72a5c88ce	add charset prober for Johab Korean	2022-12-14 00:23:13 +01:00
Jehan	2a16ab2310	src: nsEscCharsetProber also returns the correct language. nsEscCharsetProber will still only return a single candidate, because this is detected by a state machine, not language statistics anyway. Anyway now it will also return the language attached to the encoding.	2022-12-14 00:23:13 +01:00
Jehan	53f7ad0e0b	Bug 101032 - assignments to nsSMState in nsCodingStateMachine result... ... in unspecified behavior. When compiling with UBSan (-fsanitize=undefined), execution complains: > runtime error: load of value 5, which is not a valid value for type 'nsSMState' Since the machine states depend on every different charset's state machine, it is not possible to simply extend the enum with more generic values. Instead let's just make the state as an unsigned int value and define the 3 generic states as constants.	2017-05-28 20:01:06 +02:00
BYVoid	84284eccf4	Update code from upstream.	2011-07-11 14:42:50 +08:00
BYVoid	3601900164	Initial release.	2011-07-10 15:04:42 +08:00

6 Commits