6 Commits

Author SHA1 Message Date
Jehan
cec8817d79 src: new Big5 detection implementation.
Rather than using a huge frequency table through some state machine code
that I don't even understand, I noticed that the Big5 encoding is from
the start organized in frequent and non-frequent characters tables (per
Wikipedia page on Big5). This makes it very easy to count characters by
just counting which class each character is in.

Making a few tests with random Chinese text converted to Big5, it seems
to work pretty well (and fix the test which got broken with previous
commit), and it doesn't slow down detection in any significant way
either.

This may be the next step towards improving also the various multi-byte
encoding detection, which are still using some coding state generated
machines which mostly still elude me.
2022-12-19 00:01:12 +01:00
LSY
d72a5c88ce add charset prober for Johab Korean 2022-12-14 00:23:13 +01:00
Jehan
2a16ab2310 src: nsEscCharsetProber also returns the correct language.
nsEscCharsetProber will still only return a single candidate, because
this is detected by a state machine, not language statistics anyway.
Anyway now it will also return the language attached to the encoding.
2022-12-14 00:23:13 +01:00
Jehan
53f7ad0e0b Bug 101032 - assignments to nsSMState in nsCodingStateMachine result...
... in unspecified behavior.
When compiling with UBSan (-fsanitize=undefined), execution complains:
> runtime error: load of value 5, which is not a valid value for type 'nsSMState'
Since the machine states depend on every different charset's state
machine, it is not possible to simply extend the enum with more generic
values. Instead let's just make the state as an unsigned int value and
define the 3 generic states as constants.
2017-05-28 20:01:06 +02:00
BYVoid
84284eccf4 Update code from upstream. 2011-07-11 14:42:50 +08:00
BYVoid
3601900164 Initial release. 2011-07-10 15:04:42 +08:00